Vol 456 | 13 November 2008 | doi:10.1038/nature07538

LETTERS Associative learning of social value Timothy E. J. Behrens1,2*, Laurence T. Hunt1,2*, Mark W. Woolrich1 & Matthew F. S. Rushworth1,2

Our decisions are guided by information learnt from our environment. This information may come via personal experiences of reward, but also from the behaviour of social partners1,2. Social learning is widely held to be distinct from other forms of learning in its mechanism and neural implementation; it is often assumed to compete with simpler mechanisms, such as reward-based associative learning, to drive behaviour3. Recently, neural signals have been observed during social exchange reminiscent of signals seen in studies of associative learning4. Here we demonstrate that social information may be acquired using the same associative processes assumed to underlie reward-based learning. We find that key computational variables for learning in the social and reward domains are processed in a similar fashion, but in parallel neural processing streams. Two neighbouring divisions of the anterior cingulate cortex were central to learning about social and reward-based information, and for determining the extent to which each source of information guides behaviour. When making a decision, however, the information learnt using these parallel streams was combined within ventromedial prefrontal cortex. These findings suggest that human social valuation can be realized by means of the same associative processes previously established for learning other, simpler, features of the environment. To compare learning strategies for social and reward-based information, we constructed a task in which each outcome revealed information both about likely future outcomes (reward-based information) and about the trust that should be assigned to future advice from a confederate (social information). Twenty-four subjects performed a decision-making task requiring the combination of information from three sources (Fig. 1, Methods and Supplementary Information): (1) the reward magnitude of each option (generated randomly at each trial); (2) the likely correct response (blue or green) based on their own experience of rewards on each option; and (3) the confederate’s advice, and how trustworthy the confederate currently was. When a new outcome was witnessed, subjects could use this single outcome to learn in parallel about the likely correct action, and the trustworthiness of the confederate. The investigation resembles previous experiments that have compared animate and inanimate conditions in different trials or experiments5,6. Here, however, both sources of information were present on each trial outcome but the relevance of each was manipulated continuously allowing determination of both the functional magnetic resonance imaging (fMRI) signal and the behavioural influence associated with each source of information. Optimal behaviour in this task requires the subject to track the probability of the correct action and the probability of correct advice independently, and to combine these two probabilities into an overall probability of the correct response (Supplementary Information). Computational models of reinforcement learning (RL) have had considerable success in predicting how such probabilities are tracked in learning tasks outside the social domain7. The simplest RL models

integrate information over trials by maintaining and updating the expected value of each option. When new information is observed this value is updated by the product of the prediction error and the learning rate7. In our task, there are two dissociable prediction errors: the reward prediction error (actual reward 2 expected value), for learning about the correct option, and the confederate prediction error (actual 2 expected fidelity), for learning about the trustworthiness of the confederate. The optimal learning rate depends on the volatility of the underlying information source8–10. In volatile conditions, subjects should give more weight to recent information, using a fast learning rate. In stable conditions, subjects should weigh recent and historical information almost equally, using a slow learning rate. By ensuring that the correct option and the confederate’s advice became volatile at different times, we ensured that the learning rate for these two sources of information varied independently. We used a Bayesian RL8 model (Supplementary Information) to generate the optimal estimates of prediction error, volatility and outcome probability separately for each source of information (Fig. 1b–d). We first sought to establish whether human behaviour matched predictions from the RL model. We used logistic regression to determine the degree to which subject choices were influenced by the optimally tracked confederate and outcome probabilities, and by the difference in reward magnitudes between options. Parameter estimates for all three information sources were significantly greater than zero, and there was no significant difference in the degree to which subjects used reward and social information to determine their behaviour (Fig. 1e). Furthermore there was no significant effect either of subjects blindly following confederate advice without learning its value, or of subjects assuming that the confederate would behave in the same way as the previous trial (Fig. 1e). Hence subjects were able to integrate the fidelity of the confederate over many trials in an RL-like fashion. We then investigated whether the fMRI signal reflected the model’s estimates of prediction error and volatility, for both social and reward information, when subjects witnessed new outcomes. In the reward domain, neural responses have been identified that encode these key parameters8,11–16. Dopamine neurons in the ventral tegmental area (VTA) code reward prediction errors12,13,17. Similar signals are reported in the dopaminoceptive striatum11,18 and even in the VTA itself, when specialized strategies are used in human fMRI studies19. fMRI correlates of the learning rate in the reward domain have been reported in anterior cingulate sulcus (ACCs)8. If humans can learn from social information in a similar fashion, it should be possible to detect signals that co-vary with the same computational parameters, but in the social domain. We observed blood-oxygen-level-dependent (BOLD) correlates of the confederate prediction error in dorsomedial prefrontal cortex (DMPFC) in the vicinity of the paracingulate sulcus, right middle temporal gyrus (MTG), and in the right superior temporal sulcus at the temporoparietal junction (STS/TPJ) (Fig. 2a). Equivalent signals

1 FMRIB Centre, University of Oxford, John Radcliffe Hospital, Oxford OX3 9DU, UK. 2Department of Experimental Psychology, University of Oxford, South Parks Road, Oxford OX1 3UD, UK. *These authors contributed equally to this work.

245 ©2008 Macmillan Publishers Limited. All rights reserved

LETTERS

NATURE | Vol 456 | 13 November 2008

a 28

Cue: two choices presented, with associated point scores (3–7 s)

72

28

Suggest: confederate provides correct/incorrect answer (3–7 s)

72

Response 28

Interval (3–7 s)

72

?

28

72

Probability

b 1.0

Monitor: feedback showing correct answer, revealing fidelity of confederate advice (3 s) True probability of blue Estimated probability of blue

0.8 0.6 0.4 0.2 0

Probability

c

20

40

60 80 Trial number

100

120

True probability of correct confederate advice Estimated probability of correct confederate advice

1.0 0.8 0.6 0.4 0.2 0 20

40

d

60 80 Trial number

100

120

Volatility (a.u.)

Estimated volatility of reward history Estimated volatility of confederate advice

e

Logistic regression weighting (a.u.)

20

40

60 Trial number

*

*

*

RMD

RLO

RLC

80

BFC

100

Figure 1 | Experimental task and behavioural findings. a, Experimental task (see Methods and Supplementary Information). Each trial consists of four phases. Subjects are presented with a decision (Cue), receive the advice (red square) of the confederate (Suggest) and respond using a button press (grey square). An ‘Interval’ period follows, before the correct outcome is revealed (Monitor). If the subject chooses correctly the red bar is incrementally increased by the number of points on the chosen option. b, c, Reward schedules for reward (b) and social (c) information. Dashed lines show the true probability of blue being correct (b) and the true probability of correct confederate advice (c). Each schedule underwent periods of stability and volatility. Solid lines show the model’s estimate of the probabilities. d, Optimal model estimates of the volatility of reward (green) and social (red) information. e, Logistic regression on subject behaviour. Factors included were the reward magnitude difference between options (RMD); the outcome probability derived from the model using reward outcomes (RLO); the outcome probability derived from the model using confederate advice (RLC); the possibility that the subjects would blindly follow the confederate without learning (BFC); and the possibility that subjects would assume the confederate would behave as in the previous trial (CPT). The logistic regression analysis revealed significant effects only on RMD, RLO and RLC (asterisks). Error bars show s.e.m.; a.u., arbitrary units.

120

CPT

were present in the left hemisphere at the same threshold, but did not pass the cluster extent criterion; similar effects were also found bilaterally in the cerebellum (Supplementary Information). Notably, these regions showed a pattern of activation similar to known dopaminergic activity in reward learning13, but for social information. Activity correlated with the probability of a confederate lie after the subject decision but before the outcome was revealed (a prediction signal). When the subjects observed the trial outcome, activity correlated negatively with this same probability, but positively with the event of a confederate lie (Fig. 2b). This signal reflects both components of a prediction error signal for social information: the outcome (lie or truth) minus the expectation (Fig. 2b). These signals cannot be influenced by reward prediction errors as the two types of prediction error were decorrelated in the task design. The presence of this prediction

error signal in the brain is a prerequisite for any theory of an RL-like strategy for social valuation. We performed a similar analysis for prediction errors on reward information (reward minus expected reward). We found a significant effect of reward prediction error in the ventral striatum (Fig. 2c), the ventromedial prefrontal cortex, and anterior cingulate sulcus (see Supplementary Information). As in the social domain, we observed significant effects of all three elements of the reward prediction error (Fig. 2d; see Supplementary Information for discussion). As previously demonstrated8, the volatility of action–outcome associations predicted BOLD signal in a circumscribed region of the ACCs (Fig. 3a). This effect varied across people such that those whose behaviour relied more on their own experiences (Supplementary Information) showed a greater volatility-related signal in this region (Fig. 3b). The volatility of confederate advice correlated with BOLD signal in a circumscribed region in the adjacent ACC gyrus (ACCg) (Fig. 3a). Subjects whose behaviour relied more on this advice showed greater signal change in this region (Fig. 3c). Notably, this double dissociation (reflected in a three-way interaction between area (ACCs versus ACCg), volatility type (social versus outcome) and degree of reliance on social (F1,20 5 7.145, P 5 0.015) or experiential information (F1,20 5 5.379, P 5 0.031) in an analysis of covariance) can be understood by reference to a dissociation in macaque monkeys. Selective lesions to ACCs but not ACCg impair reward-guided decision making in the reward domain20. In the social domain, male macaques will forego food to acquire information about other individuals21,22. Selective lesions to ACCg but not ACCs abolish this effect23. We found that BOLD signals in these two regions reflect the respective values of the same outcome for learning about the two different sources of information. Learning about reward probability from vicarious and personal experiences recruits distinct neural systems, but subjects combine information across both sources when making decisions (Fig. 1e). A ventromedial portion of the prefrontal cortex (VMPFC) has been shown to code such an expected value signal for the chosen action24,25 during decision making. We computed two probabilities of reward on the subject’s chosen option: one based only on experience and one based only on confederate advice. BOLD signal in the VMPFC was significantly correlated with both probabilities (Fig. 4a and Supplementary Fig. 4). However, there was subject variability in whether the VMPFC signal better reflected the reward probability based on outcome history or on social information. The extent to which the VMPFC data reflected each source of information (at the time of the decision) was predicted by the ACCs/ACCg response to outcome/social volatility (at the time when the outcomes were witnessed) (Fig. 4b, c).

246 ©2008 Macmillan Publishers Limited. All rights reserved

LETTERS

NATURE | Vol 456 | 13 November 2008

a

x = 0 mm

c

x = 54 mm

y = 16 mm

b

0.15

Effect of lie event Effect of lie probability Trial onset Suggestion Response Outcome

d

0.05 0 –0.05 –0.1 0 0.15

5

Trial onset Suggestion

10

15

Response

Outcome

10 Time (s)

15

20

25

0.1 0.05 0 –0.05 –0.1 0

5

20

25

(Partial) correlation between BOLD signal and regressor of interest (r)

(Partial) correlation between BOLD signal and regressor of interest (r)

0.1

Effect of reward magnitude 0.6 0.5 0.4 0.3 0.2 0.1 0 −0.1 −0.2

Trial onset Suggestion

0 0.6 0.5 0.4 0.3 0.2 0.1 0 −0.1 −0.2

5

Trial onset Suggestion

0

5

Effect of expected value

Response

Outcome

10 Time (s)

15

Response

Outcome

10 Time (s)

15

20

25

20

25

Figure 2 | Predictions and prediction errors in social and non-social domains. Time courses show (partial) correlations 6 s.e.m. (See Supplementary Fig. 2.) a, Activation in the DMPFC, right TPJ/STS and MTG correlate with the social prediction error at the outcome (threshold set at Z . 3.1, cluster size .50 voxels). b, Deconstruction of signal change in the DMPFC. Similar results were found in the MTG and TPJ/STS. Top: following the outcome, areas that encode prediction error correlate positively with the outcome and negatively with the predicted probability. Red, effect size of the confederate lie outcome (1 for lie, 0 for truth); blue, effect size of the predicted confederate lie probability. To perform inference, we fit a haemodynamic model in each subject to the time course of this effect (that is, to the blue line). The green line in the top panel shows the mean overall fit of this haemodynamic model (for comparison with the blue line). Bottom: the effect of lie probability (blue line from top panel) is decomposed into a haemodynamic response function at each trial event (corresponding to the four colours in the bottom panel) (see Supplementary Fig. 2). Dashed and solid lines show mean responses 6 s.e.m. Each region showed a significant positive effect of predicted confederate lie probability after the

decision (t22 5 1.96 (P , 0.05), 1.73 (P , 0.05), 1.74 (P , 0.05) for DMPFC, MTG and TPJ/STS, respectively). Crucially, each brain region showed a significant negative effect of predicted confederate lie probability after the outcome (t22 5 2.68 (P , 0.005), 2.35 (P , 0.05), 3.27 (P , 0.005)). c, Ventral striatum is taken as an example of a number of regions revealed by the voxel-wise analysis of reward prediction error (threshold set at Z . 3.1, cluster size .100 voxels). d, Panels are exactly as in b, but coded in terms of reward and not in terms of confederate fidelity. The top panel shows the parameter estimate relating to the expected value of the trial (blue line) and, after the outcome, the parameter estimate relating to the magnitude of these rewards (red line). To test for prediction error coding, we again fit a haemodynamic model to the expectation parameter estimate (shown by the green line, for comparison with blue line). Bottom panel: the time course showed a significant positive effect during the time of the decision (t22 5 3.32 (P , 0.002)), and a significant negative effect after the trial outcome (t22 5 2.50 (P , 0.05)). (See Supplementary Information for further discussion.)

Here, we have shown that the weighting assigned to social information is subject to learning and continual update via associative mechanisms. We use techniques that predict behaviour when learning from personal experiences to show that similar mechanisms explain behaviour in a social context. Furthermore, we demonstrate fundamental similarities between the neural encoding of key parameters for reward-based and social learning. Despite using similar mechanisms, distinct anatomical structures code learning parameters in the two domains. However, information from both is combined in ventromedial prefrontal cortex when making a decision. By comparing the two sources of information, we find that social prediction error signals similar to those reported in dopamine neurons for reward-based learning are coded in the MTG, STS/TPJ and DMPFC. BOLD signal fluctuations in these regions are often seen in social tasks26,27, and in tasks which involve the attribution of motive to stimuli28. Such activations have been thought critical in studies of the theory of mind28. That these regions should code quantitative prediction and prediction error signals about a confederate lends more weight to the argument that social evaluation mechanisms are able to rely on simple associative processes.

A second crucial parameter in reinforcement learning models is the learning rate, reflecting the value of each new piece of information. In the context of reward-based learning, this parameter predicts BOLD signal fluctuations in the ACCs at the crucial time for learning8—a finding that is replicated here. We further demonstrate that the exact same computational parameter, in the context of social learning, predicts BOLD fluctuations in the neighbouring ACCg. This functional dissociation is mirrored by differences in the regions’ anatomical connectivity. In the macaque monkey, connections with motor regions lie predominantly in ACCs29, giving access to information about the monkey’s own actions. Connections with visceral and social regions, including the STS, lie predominantly in ACCg29, giving access to information about other agents. Nevertheless, that it is the same computational parameter that is represented in ACCs and ACCg suggests that parallel streams of learning occur within ACC for social and non-social information. It has been suggested that VMPFC activity might represent a common currency in which the value of different types of items might be encoded25,30. Here we show that the same portion of the VMPFC represents the expected value of a decision based on the combination 247

©2008 Macmillan Publishers Limited. All rights reserved

LETTERS

NATURE | Vol 456 | 13 November 2008

a

a

Signal change elicited by historybased probability of chosen option in VMPFC during decision (%)

b b 0.4 Signal change elicited by reward history volatility in ACCs (%)

x = –4 mm

0.3 0.2 0.1 0 –0.1

0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.3

–0.2

c 0.4

Outcome weighting (a.u.)

Signal change elicited by confederate advice-based probability of chosen option in VMPFC during decision (%)

Signal change elicited by confederate advice volatility in ACCg (%)

c

y = 28 mm

y = 12 mm

0.3 0.2 0.1 0 –0.1 –0.2 –0.3 –0.4 Confederate weighting (a.u.)

Figure 3 | Agency-specific learning rates dissociate in the ACC. a, Regions where the BOLD correlates of reward (green) and confederate (red) volatility predict the influence that each source of information has on subject behaviour (Z . 3.1, P , 0.05 cluster-corrected for cingulate cortex). b, Subjects with high BOLD signal changes in response to reward volatility in the ACCs are guided strongly by reward history information (maximum Z 5 3.7, correlation R 5 0.7163, P , 0.0001). c, Subjects with high BOLD signal changes in response to confederate advice volatility in the ACCg are guided strongly by social information (maximum Z 5 4.1, correlation R 5 0.7252, P , 0.0001). See Supplementary Information.

of information from social and experiential sources. However, the extent to which the VMPFC signal reflects each source of information during a decision is predicted by the extent to which the ACCs and ACCg modulate their activity at the point when information is learnt. If, as is suggested, the VMPFC response codes the expected value of a decision, then the ACCs response to each new outcome predicts the extent that this outcome will determine future valuation of an action; the ACCg response predicts the extent to which this outcome will determine future valuation of an individual. METHODS SUMMARY

0.1 0.3 0.4 –0.2 –0.1 0 0.2 Signal change elicited by reward history volatility in ACCs during feedback (%)

0.4 0.3 0.2 0.1 0 –0.1 –0.2 –0.3 –0.4 –0.2 –0.15 –0.1 –0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 Signal change elicited by confederate advice volatility in ACCg during feedback (%)

Figure 4 | Combination of expected value of chosen option in VMPFC. a, Activation for the combination (mean contrast) of experience-based probability during the Cue and Suggest phases, and advice-based probability during the Suggest phase (threshold set at Z . 3.1, P , 0.005 clustercorrected for VMPFC). These phases represent the times at which subjects had these probabilities available to them (see Supplementary Fig. 4). b, Correlation between the effect of outcome-based probability in VMPFC during the decision and the effect of outcome volatility in ACCs during the Monitor phase (R 5 0.7113, P , 0.0002). c, Correlation between the effect of confederate-based probability in VMPFC during the decision and the effect of confederate volatility in ACCs during the Monitor phase (R 5 0.6119, P , 0.002). See Supplementary Information. but not the subject. The confederate might therefore reasonably give consistently helpful or unhelpful advice, but this advice might change as the game progressed (Supplementary Information). During the experiment, the confederate was replaced by a computer that gave correct advice on a prescribed set of trials. Subjects knew that the trial outcomes were determined by an inanimate computer program, but believed that the social advice came from an animate agent’s decision. Full Methods and any associated references are available in the online version of the paper at www.nature.com/nature. Received 8 September; accepted 14 October 2008.

Short description of task (Fig. 1a). Subjects performed a decision-making task while undergoing fMRI, repeatedly choosing between blue and green rectangles, each of which had a different reward magnitude available on each trial. The chance of the rewarded colour being blue or green depended on the recent outcome history. Before the experiment, subjects were introduced to a confederate. At each trial, the confederate would choose between supplying the subject with the correct or incorrect option, unaware of the number of points available. The subject’s goal was to maximize the number of points gained during the experiment. In contrast, the confederate’s goal was to ensure that the eventual score would lie within one of two pre-defined ranges, known to the confederate

1. 2. 3.

4. 5.

Fehr, E. & Fischbacher, U. The nature of human altruism. Nature 425, 785–791 (2003). Maynard Smith, J. Evolution and the Theory of Games (Cambridge Univ. Press, 1982). Delgado, M. R., Frank, R. H. & Phelps, E. A. Perceptions of moral character modulate the neural systems of reward during the trust game. Nature Neurosci. 8, 1611–1618 (2005). King-Casas, B. et al. Getting to know you: reputation and trust in a two-person economic exchange. Science 308, 78–83 (2005). Rilling, J. et al. A neural basis for social cooperation. Neuron 35, 395–405 (2002).

248 ©2008 Macmillan Publishers Limited. All rights reserved

LETTERS

NATURE | Vol 456 | 13 November 2008

6. 7. 8. 9. 10. 11. 12. 13. 14.

15. 16. 17. 18.

19.

20.

21.

Gallagher, H. L., Jack, A. I., Roepstorff, A. & Frith, C. D. Imaging the intentional stance in a competitive game. Neuroimage 16, 814–821 (2002). Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998). Behrens, T. E., Woolrich, M. W., Walton, M. E. & Rushworth, M. F. Learning the value of information in an uncertain world. Nature Neurosci. 10, 1214–1221 (2007). Courville, A. C., Daw, N. D. & Touretzky, D. S. Bayesian theories of conditioning in a changing world. Trends Cogn. Sci. 10, 294–300 (2006). Dayan, P., Kakade, S. & Montague, P. R. Learning and selective attention. Nature Neurosci. 3 (Suppl.), 1218–1223 (2000). O’Doherty, J. et al. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 304, 452–454 (2004). Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997). Waelti, P., Dickinson, A. & Schultz, W. Dopamine responses comply with basic assumptions of formal learning theory. Nature 412, 43–48 (2001). Matsumoto, M., Matsumoto, K., Abe, H. & Tanaka, K. Medial prefrontal cell activity signaling prediction errors of action values. Nature Neurosci. 10, 647–656 (2007). Tanaka, S. C. et al. Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nature Neurosci. 7, 887–893 (2004). Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006). Bayer, H. M. & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47, 129–141 (2005). Haruno, M. & Kawato, M. Different neural correlates of reward expectation and reward expectation error in the putamen and caudate nucleus during stimulusaction-reward association learning. J. Neurophysiol. 95, 948–959 (2006). D’Ardenne, K., McClure, S. M., Nystrom, L. E. & Cohen, J. D. BOLD responses reflecting dopaminergic signals in the human ventral tegmental area. Science 319, 1264–1267 (2008). Kennerley, S. W., Walton, M. E., Behrens, T. E., Buckley, M. J. & Rushworth, M. F. Optimal decision making and the anterior cingulate cortex. Nature Neurosci. 9, 940–947 (2006). Deaner, R. O., Khera, A. V. & Platt, M. L. Monkeys pay per view: adaptive valuation of social images by rhesus macaques. Curr. Biol. 15, 543–548 (2005).

22. Shepherd, S. V., Deaner, R. O. & Platt, M. L. Social status gates social attention in monkeys. Curr. Biol. 16, R119–R120 (2006). 23. Rudebeck, P. H., Buckley, M. J., Walton, M. E. & Rushworth, M. F. A role for the macaque anterior cingulate gyrus in social valuation. Science 313, 1310–1312 (2006). 24. O’Doherty, J. P. Reward representations and reward-related learning in the human brain: insights from neuroimaging. Curr. Opin. Neurobiol. 14, 769–776 (2004). 25. Kable, J. W. & Glimcher, P. W. The neural correlates of subjective value during intertemporal choice. Nature Neurosci. 10, 1625–1633 (2007). 26. Amodio, D. M. & Frith, C. D. Meeting of minds: the medial frontal cortex and social cognition. Nature Rev. Neurosci. 7, 268–277 (2006). 27. Allison, T., Puce, A. & McCarthy, G. Social perception from visual cues: role of the STS region. Trends Cogn. Sci. 4, 267–278 (2000). 28. Castelli, F., Frith, C., Happe, F. & Frith, U. Autism, Asperger syndrome and brain mechanisms for the attribution of mental states to animated shapes. Brain 125, 1839–1849 (2002). 29. Van Hoesen, G. W., Morecraft, R. J. & Vogt, B. A. in Neurobiology of Cingulate Cortex and Limbic Thalamus (eds Vogt, B. A. & Gabriel, M.) (Birkha¨user, 1993). 30. Plassmann, H., O’Doherty, J. & Rangel, A. Orbitofrontal cortex encodes willingness to pay in everyday economic transactions. J. Neurosci. 27, 9984–9988 (2007).

Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgments We would like to acknowledge funding from the UK MRC (T.E.J.B., M.F.S.R.), the Wellcome Trust (L.T.H.) and the UK EPSRC (M.W.W.). We thank S. Knight for helping with data acquisition, and K. Watkins for help with figure preparation. Author contributions All four authors contributed to generating the hypothesis and designing the experiment. Where specific roles can be assigned, L.T.H. collected the data, T.E.J.B. and L.T.H. analysed the data, T.E.J.B. and M.W.W. built the model, and M.F.S.R. supervised the project. Author Information Reprints and permissions information is available at www.nature.com/reprints. Correspondence and requests for materials should be addressed to T.E.J.B. ([email protected]).

249 ©2008 Macmillan Publishers Limited. All rights reserved

doi:10.1038/nature07538

METHODS Detailed analysis of the task, the learning model, the behavioural analysis, the data acquisition and pre-processing, and several further results and discussion can be found in the Supplementary Information. Here, we describe aspects of the fMRI modelling that may be relevant to the interpretation of our results. Further technical details can also be found in the Supplementary Information. fMRI single-subject modelling. We performed two fMRI GLM analyses using FMRIB’s Software library (FSL, ref. 31). The first looked for learning-related activity (Figs 2, 3 and Supplementary Fig. 3), the second for decision-related activity (Fig. 4 and Supplementary Fig. 4). In each case a general linear model was fit in pre-whitened data space (to account for autocorrelation in the fMRI residuals)32. Regressors were convolved and filtered according to FSL defaults (see Supplementary Information). The following regressors (plus their temporal derivatives) were included in the time series model (learning-related activity): four regressors defining the different times during the task (see Fig. 1 and Supplementary Information), namely Cue, Suggest, Interval, Monitor; four regressors defining key learning parameters when the outcomes are presented (see Supplementary Information), namely (Monitor 3 Reward history volatility), (Monitor 3 Confederate volatility), (Monitor 3 Reward prediction error), (Monitor 3 Confederate prediction error). The following regressors (plus their temporal derivatives) were included in the time series model (decision-related activity): four regressors defining the different times during the task (see Fig. 1 and Supplementary Information), namely Cue, Suggest, Interval, Monitor; seven regressors defining key decision parameters at the times when they were available during the decision (see Supplementary Information), namely (Cue 3 Experience-based probability), (Suggest 3 Experienced-based probability), (Suggest 3 Confederate-based probability), (Cue 3 Chosen reward magnitude), (Suggest 3 Chosen reward magnitude), (Cue 3 Unchosen reward magnitude), (Suggest 3 Unchosen reward magnitude). Note that probabilities were log-transformed such that their linear combination in the GLM would approximate the optimal combination for behaviour (see Supplementary Information). Figure 4a was generated using the mean ([1 1 1]) contrast of all probability-related regressors. fMRI group modelling. fMRI group analyses were carried out using a GLM with three regressors: a group mean, the weight for reward history information based on each subject’s behaviour (see Supplementary Information), and the weight for

confederate information based on each subject’s behaviour (see Supplementary Information). fMRI region of interest analyses (Fig. 2). The following processing steps are illustrated schematically in Supplementary Fig. 2 and described in more detail in the Supplementary Information. Individual subject data were taken from regions of interest defined by the group clusters. Data from each trial were up-sampled and re-aligned to points in the trial corresponding to the onset of the four trial stages. Data were Z-normalized across trials at each time point in the trial. We then performed two general linear models across trials for both reward and confederate prediction errors. This allowed us (1) to test at which points in the trial the data correlated with the prediction of reward, or the prediction of confederate fidelity, and (2) to test at which points after the outcome the data correlated with the trial outcome, or actual confederate fidelity. A prediction error signal should comprise three parts. (1) A positive correlation with the prediction after the decision; (2) a positive correlation with the trial outcome at the time of this outcome; (3) a negative correlation with the prediction at the time of the outcome (as a prediction error is defined as the outcome minus the prediction). We witnessed all three parts of the confederate prediction error as deflections in BOLD correlations at the relevant times. However, owing to the nature of the haemodynamic response, it is difficult to test significance from just these deflections. We therefore fit a haemodynamic model to these correlation profiles in each subject (see Supplementary Information). The key test was whether the time course of correlations with the prediction could be accounted for by a positive haemodynamic impulse at the time of the decision and a negative haemodynamic impulse at the time of the outcome; and whether the time course of correlations with the outcome could be accounted for by a positive haemodynamic impulse at the time of the outcome. By fitting the haemodynamic model we were able to measure three parameter estimates for each of these three haemodynamic impulses in each subject, and perform random-effects t-tests to measure statistical significance of each. 31. Smith, S. M. et al. Advances in functional and structural MR image analysis and implementation as FSL. Neuroimage 23 (Suppl. 1), S208–S219 (2004). 32. Woolrich, M. W., Ripley, B. D., Brady, M. & Smith, S. M. Temporal autocorrelation in univariate linear modeling of FMRI data. Neuroimage 14, 1370–1386 (2001).

©2008 Macmillan Publishers Limited. All rights reserved

doi: 10.1038/nature07538

SUPPLEMENTARY INFORMATION

Experimental task and timings 24 human subjects (14M/10F, mean age 29, std 11 years and 9 months, age range 20-62, 4 left-handed) performed a decision-making task whilst undergoing fMRI, in which they repeatedly chose between blue and green rectangles in order to accumulate points. The point score (a random number between 1 and 100) associated with blue (fblue) and green (fgreen) was shown in the centre of each rectangle; this number was added to the subject’s score if they chose the correct option. Subjects saw a red bar onscreen, whose length was proportional to their current score; they aimed to reach a silver target to win £10, or a gold target to win £20 (main figure 1a). Subjects were instructed that either blue or green would be correct on each trial, but that the probability of the two colours being correct was not equal – instead, the chance of each colour being correct depended upon the recent outcome history. Subjects were informed that the probabilities of each colour being correct were independent of the rewards available. Thus, as a result of the difference in reward magnitudes associated with the blue and green options, subjects often picked the less likely colour if it was associated with a higher reward. As the probability of green being correct (r) was always the inverse probability of blue being correct (1-r), subjects (and the model used to estimate volatility) needed only track one probability. On each trial, 3-7 seconds after first seeing the stimuli (CUE phase), subjects received computer-generated advice about which rectangle to choose from a “human confederate”, supposedly playing outside the scanner. This advice appeared for 3-7 seconds (ADVICE phase) before the subject was allowed to make their decision, and remained onscreen until an option was selected. After the subjects had made their

www.nature.com/nature

1

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

choice, there was a 3-7 second interval (INTERVAL phase) before the correct answer was revealed. The correct answer remained onscreen for 3 seconds (MONITOR phase), and was then replaced by a fixation point for 1 second before the next trial began. Subjects were introduced to an actor before the experiment began, and both subject and actor were taken through the experimental instructions and practised the task together. The confederate had two ‘ranges’ presented on their screen, gold and silver, which the subject would be unable to see during the experiment (fig. S1). In front of the subject, confederates were told that if the subject’s red bar ended the experiment within one of these ranges, the confederate would receive £20 (gold) or £10 (silver). On each trial, the confederate would be given two options: ‘Provide correct answer’ or ‘Provide incorrect answer’. When they made their choice, the correct or incorrect answer would be highlighted on the subject’s screen (‘ADVICE’, main fig. 1a). It was made clear that the confederate was not able to see whether blue/green was the correct answer, nor see the rewards available on each trial – their advice would therefore be independent of these other sources of information. Subjects were also told that the confederate was unable to see whether or not they took the advice, and so they could make use of consistently unhelpful advice by going against their confederate’s suggestions. The only feedback that the confederate would receive was an update, approximately every five trials, of how far advanced the subject’s red bar was, and how far through the experiment the subject had progressed. The ranges could be located anywhere along the length of this bar; they could be close together or far apart. Thus, as in fig. S1, situations could easily be designed in which a confederate might reasonably give unhelpful advice initially (to try to land the subject in the gold range by the end of the experiment), but change this advice as

www.nature.com/nature

2

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

the subject did better than expected, and the confederate’s motive changed (to try to land the subject in the silver range instead). Several examples of different situations were given in the initial instructions, to explicitly make clear to the subject that the confederate’s motives would depend upon the location of these ranges, and that these motives might change over time. As the subject was unable to see the two ranges, their only insight into the confederate’s current motive would be the reliability of the advice that they received on each trial. Subjects underwent 120 trials in total (main figure 1b,c). During the first 60 trials, the reward history was stable, with a 75% probability of blue being correct. During the next 60 trials, the reward history was volatile, switching between 80% green correct and 80% blue correct every 20 trials. Meanwhile, during the first 30 trials, the social advice given was stable, with 75% of suggestions being correct. During the next 40 trials, the social advice given was volatile, switching between 80% incorrect and 80% correct every 10 trials. During the final 50 trials, the advice given was stable again, with 85% of suggestions being incorrect. In order to counterbalance the design, eleven of the subjects had the advice inverted, such that the first 30 trials were 75% incorrect, and the last 50 trials were 85% correct. Hence, the dashed line in main fig 1(c) refers to the probability of true advice in half the subjects and the probability of false advice in the other half.

Fig. S1: Confederate task. ‘Confederates’ were instructed (in the presence of the subject) that their task was to use their advice to land the subject’s red bar within one of two ‘ranges’ by the end of the experiment. The ranges would be known to the confederate but not to the subject. It was important that the subject had a clear idea of

www.nature.com/nature

3

doi: 10.1038/nature07538

SUPPLEMENTARY INFORMATION

what the confederate would be doing in the task, as the subject did not know a computer would replace the confederate. It was particularly important that the subject believed that the confederate could be motivated either to give consistently good advice or consistently bad advice or to change their strategy at any point in the task. In the example shown here, the ranges are far apart, and the so the confederate’s motive might reasonably change as the subject did better or worse than expected. By telling the subject the structure of the goal that the confederate was aiming for, but not the actual location of the ranges, we were able to establish a situation in which the subject would be happy to believe that the confederate might tell the truth, lie or change between these alternatives, but also a situation in which the subject could not use cognitive strategies to work out how the confederate should be playing. The subject could therefore only by predict the behaviour of the confederate by considering his previous advice.

Bayesian probability tracking We used a previously published a Bayesian model to generate optimal estimates of the probabilities of trial outcome and of confederate fidelity. This algorithm has been documented in detail elsewhere1, but we briefly describe its concept here. The model assumes that outcomes are generated with an underlying probability, r. The objective is to track r as it changes through time. The crucial question addressed by the model is how much the estimate of r should be updated when a new positive or negative outcome is observed. An unexpected event may be just chance or it may signal a change in the underlying reward probability. In order to know how much to update the estimate of r on witnessing a new outcome, it is crucial to know the rate of change of r. If r is changing fast on average then an unlikely event is more likely to signify a big change in r so an optimal learner should make a big update to its estimate. The Bayesian model therefore maintains an estimate of the expected rate of change of r, referred to as the volatility v. In a fast changing environment, the model will estimate a high volatility and therefore each new outcome will have a large influence on the optimal estimate of the reward rate. Conversely in a slow changing environment, the

www.nature.com/nature

4

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

model will estimate a low volatility and each new outcome will have a negligible effect on the model’s estimate of r.

Behavioural analysis We performed a multivariate logistic regression to establish factors that predicted subject behaviour (main figure 1e). If subjects were performing the task optimally, they would learn a probability associated with blue rather than green being correct based on the history of outcomes. They would also learn a separate probability of the confederate giving correct advice based on the history of correct and incorrect advice at previous trials. When the confederate advice became visible at the current trial, subjects should then combine these probabilities to provide an overall probability that blue (and conversely green) would be the correct option. This probability should then be weighed with the respective reward magnitudes on each option to guide the final decision. Using the Bayesian reinforcement learning model described above1, we generated the optimal estimates of these probabilities based on the same observations witnessed by the subjects in the scanner (main figure 1b,c). If subjects were learning the probability of both the outcome and confederate advice according to such an associative strategy, these two factors should be key in predicting subject behaviour. We also considered as factors two alternative strategies that might predict subject behaviour with respect to the confederate advice. First, subjects might blindly follow confederate advice without learning the probability that this advice would be good; and second subjects might appreciate that the confederate may have a strategy of

www.nature.com/nature

5

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

giving bad advice, but subject may fail to integrate this advice over a number of trials in an RL-like fashion, instead relying only on the confederate’s most recent behaviour, analogous to common tit-for-tat models of social behaviour.

We therefore had five factors with which to predict subject choices (coded 1 for occasions when subjects chose blue and 0 for occasions when subjects chose green). 1) The difference in reward magnitudes on the two options (fblue-fgreen) 2) The RL probability that blue would be correct given the history of outcomes 3) The RL probability that blue would be correct given the advice of the confederate at the current trial and the history of correct confederate advice 4) The confederate advice at the current trial, ignoring the confederate’s history 5) The confederate advice at the current trial interacted with the correctness of the confederate advice at the previous trial.

The 4th factor has the value 1 whenever the confederate advises blue, and 0 whenever the confederate advises green. The 5th factor has the value 1 when the confederate advises blue on the current trial, after giving correct advice at the previous trial, or when the confederate advises green, after giving incorrect advice at the previous trial; otherwise this factor has the value 0.

The logistic regression for each subject analysis results in a parameter estimate for each factor, reflecting the extent to which that factor predicts subject choices. Results are presented in main figure 1e. Significant effects were also analysed in individual subjects (Z>2.3, p<0.01). This analysis shows that BFC and CPT were each

www.nature.com/nature

6

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

significant in 3/23 subjects, whereas RLO and RLC each showed significant effects in 14/23 subjects and RMD showed a significant effect in 19/23 subjects.

FMRI data FMRI data were acquired in 24 subjects on a 3T Siemens TRIO scanner. Data were excluded from one subject due to rapid head motion. The remaining 23 subjects were included in the analysis. FMRI data were acquired with a voxel resolution of 3x3x3 mm3, TR=3s, TE=30ms, Flip angle=87o. The slice angle was set to 15o and a local z-shim was applied around the orbitofrontal cortex to minimize signal dropout in this region2, which had been implicated in other aspects of decision-making in previous studies. The number of volumes acquired depended on the behaviour of the subject. The mean number of volumes was 943, giving a total experiment time of approximately 47 minutes. Stimulus presentation/subject button presses were registered and time-locked to FMRI data using Presentation (Neurobehavioural Systems, USA). Field Maps were acquired using a dual echo 2D gradient echo sequence with echos at 5.19 and 7.65 ms, and repetition time of 444ms. Data were acquired on a 64x64x40 grid, with a voxel resolution of 3mm isotropic. T1-weighted structural images were acquired for subject alignment using an MPRAGE sequence with the following parameters: Voxel resolution 1x1x1 mm3 on a 176x192x192 grid, Echo time(TE)= 4.53 ms, Inversion time(TI)= 900 ms, Repetition time (TR)= 2200 ms.

www.nature.com/nature

7

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

FMRI analysis FMRI analysis was carried out using FMRIB’s Software library (FSL3). Single subject processing. Preprocessing Data were preprocessed using FSL default options: motion correction was applied using rigid body registration to the central volume4; Gaussian spatial smoothing was applied with a full width half maximum of 5mm; brain matter was segmented from non-brain using a mesh deformation approach5; high pass temporal filtering was applied using a Gaussian-weighted running lines filter, with a 3dB cutoff of 100s. Susceptibility-related distortions were corrected as far as possible using FSL field-map correction routines6. Model estimation (learning-related activity) A general linear model was fit in pre-whitened data space (to account for autocorrelation in the FMRI residuals)7. The following regressors (plus their temporal derivatives) were included in the model:

1. CUE – times when options and reward values were presented onscreen, but not social advice; 2. ADVICE – times when options, reward values and social advice were all presented onscreen; 3. INTERVAL – times between making a response and the outcome being revealed; 4. MONITOR – times when the outcome of the trial was presented onscreen; 5. MONITOR x REWARD HISTORY VOLATILITY – monitor phase, modulated by the estimated volatility in the reward history on each trial;

www.nature.com/nature

8

doi: 10.1038/nature07538

SUPPLEMENTARY INFORMATION

6. MONITOR x CONFEDERATE ADVICE HISTORY VOLATILITY – monitor phase, modulated by the estimated volatility in the confederate advice history on each trial. 7. MONITOR x REWARD PREDICTION ERROR – monitor phase, modulated by the prediction error in the frame of reference of the reward; 8 MONITOR x CONFEDERATE PREDICTION ERROR – monitor phase, modulated by the prediction error in the frame of reference of fidelity of the confederate advice;

These regressors were convolved with the FSL default haemodynamic response function (Gamma function, delay=6s, standard deviation =3s), and filtered by the same high pass filter as the data.

Model estimation (decision-related activity) A separate general linear model was fit in pre-whitened data space (to account for autocorrelation in the FMRI residuals)7. We computed two potential values of the subject’s chosen option, each one based only on either social or non-social information (i.e. (i) the probability of a reward based only on experience and (ii) the probability of a reward based only on confederate advice). We used these two values, together with information about the reward magnitude, as regressors in our analysis. Information about reward magnitude and experience-based probability was available to subjects from the beginning of each trial (from the CUE phase onwards), whereas information about the collaborator-based probability was only available to subjects once the suggestion had been presented (SUGGEST phase). Each regressor was therefore interacted with the time the information was available.

www.nature.com/nature

9

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

The following regressors (plus their temporal derivatives) were therefore included in the model (see main text): 1. CUE – times when options and reward values were presented onscreen, but not social advice; 2. ADVICE – times when options, reward values and social advice were all presented onscreen; 3. INTERVAL – times between making a response and the outcome being revealed; 4. MONITOR – times when the outcome of the trial was presented onscreen; 5. CUE x EXPERIENCE-BASED PROBABILITY – cue phase, modulated by the logarithm of the probability of the chosen action based on subjects’ previous experience; 6. SUGGEST x EXPERIENCE-BASED PROBABILITY – suggest phase, modulated by the logarithm of the probability of the chosen action based on subjects’ previous experience; 7. SUGGEST x CONFEDERATE ADVICE-BASED PROBABILITY – suggest phase, modulated by the logarithm of the probability of the chosen action based on current confederate advice and previous confederate fidelity; 8. CUE x CHOSEN REWARD MAGNITUDE – cue phase, modulated by the logarithm of the reward magnitude of the chosen action; 9. SUGGEST x CHOSEN REWARD MAGNITUDE – suggest phase, modulated by the logarithm of the reward magnitude of the chosen action; 10. CUE x UNCHOSEN REWARD MAGNITUDE – cue phase, modulated by the logarithm of the reward magnitude of the unchosen action; 11. SUGGEST x UNCHOSEN REWARD MAGNITUDE – suggest phase, modulated by the logarithm of the reward magnitude of the unchosen action.

www.nature.com/nature

10

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

In the case of the overall analysis of expected value (fig. 4a), a [1 1] contrast was performed across regressors 7 and 8 in the decision-related analysis shown above.

Note that in order to compute an overall probability the subjects must weigh the two sources of information (experience based probability and advice-based probability) in a Bayesian fashion (see later). To a first order approximation this constitutes a multiplication of the two probabilities. This overall probability should then be multiplied by the reward magnitude to obtain the Pascalian value of each option (see below). In order to linearise this problem for FMRI, we therefore entered as regressors the logarithm of these values. These regressors were convolved with the FSL default haemodynamic response function (Gamma function, delay=6s, standard deviation =3s), and filtered by the same high pass filter as the data.

Group data processing. Subjects were aligned to the MNI152 template using affine registration8. A general linear model was fit to the effects of the regressors described above9. This group GLM contained three factors: 1) A group mean. 2) The weight for reward history information based on each subject’s behaviour (see below). 3) The weight for confederate information based on each subject’s behaviour (see below).

www.nature.com/nature

11

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

Inference: Volatility effects. Effects of volatility were hypothesized to be present in the anterior cingulate cortex (ACC) based on previous data 1. We therefore performed cluster inference (Z>3.1) correcting for multiple comparisons at p<0.05 within a hand-drawn mask of the ACC. This required that there be more than 25 contiguous voxels.

Prediction Error effects. Prediction error effects are reported for clusters of greater than 50 contiguous voxels (for social prediction error) and greater than 100 contiguous voxels (for reward prediction error) at Z>3.5.

Expected Value effects. Effects of the two individual expected value signals (supplementary figure 4) are reported at Z>2.6 (p<0.01 uncorrected; p<0.05 cluster-corrected). Effects of the combination of the two probabilities (fig 4a) were hypothesized to be present in ventromedial prefrontal cortex (vmPFC) based on previous data10-12. We therefore performed cluster inference correcting for multiple comparisons at p<0.05 within a hand-drawn mask of vmPFC. This required that there be more than 21 contiguous voxels.

Using the above criteria, the following regions showed significant effect of social and reward prediction errors: Social Prediction error Cluster Max MNI

www.nature.com/nature

MNI

MNI

Location

12

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

size (voxels) 216 198 182 111 84 70

Z

x(mm) y(mm) z(mm)

4.11 4.73 4.3 3.98 3.81 4.23

-26 2 -8 32 54 54

-72 54 44 -60 -30 -48

-36 16 38 -48 -16 30

Left Cerebellum dmPFC dmPFC Right Cerebellum Right MTG Right STS/TPJ

Reward Prediction errors Cluster size (voxels) 1444 841 836 766 313

Max MNI MNI MNI Location Z x(mm) y(mm) z(mm) 5.33 4.74 4.54 4.98 4.61

8 52 34 2 6

14 -68 -14 52 -28

-10 -14 48 -8 48

244 197 189 184 121

3.92 4.52 4.21 4.23 4.61

44 -6 16 -20 14

-34 -80 -76 -66 -48

56 -16 56 -48 -54

Ventral Striatum Extra-striate cortex Precentral gyrus vmPFC Posterior Cingulate sulcus Extra-striate cortex Striate cortex Dorsal parietal cortex Left cerebellum Right cerebellum

FMRI ROI analysis Region of interest analyses were performed on activations reflecting the prediction errors on confederate and reward information. These analyses were performed in order to determine the nature of the BOLD signal fluctuations and their relationship to the expected fluctuations induced by prediction and prediction error signals. Supplementary figure 2 shows an outline of this analysis, which is described in detail below.

www.nature.com/nature

13

doi: 10.1038/nature07538

www.nature.com/nature

SUPPLEMENTARY INFORMATION

14

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

Supplementary Figure 2: Schematic of ROI analysis (described below). We took BOLD data in each subject from masks back-projected from the each group prediction error region. We separated each subject’s timeseries into each trial, and resampled each trial to a duration of 25s, such that the decision was presented at 0s, the confederate advice was presented at 5s, the response was given at 12s and the outcome was presented from 17s-20s. (These timings were the mean timings across all trials in all subjects.) The resampling resolution was 100ms. We then performed two separate GLMs across trials in each subject. The first GLM included a regressor for the prediction (the estimated probability of a confederate lie in figure 2b, and the expected value of the trial in figure 2d). The second GLM included regressors for the prediction and for the outcome (in figure 2b, the outcome was the event of a collaborator lie (1 for lie and 0 for truth). In figure 2d the outcome was the reward itself). We then calculated the group average effect sizes (i.e. the mean of the effect across subjects) at each timepoint, and their standard errors. Data and regressors were Z-normalised so that effect sizes could be reported as (partial) correlations. The graphs in the top panels of main figures 2b,d, therefore show a timeseries of effect sizes (partial correlations) for the prediction throughout the trial (blue) and for the outcome after the outcome period (red). In each case, a prediction signal should therefore show a positive effect in the blue curve before the outcome. A prediction error signal (outcome – expectation) should show a positive effect of the red curve and a negative effect of the blue curve after the outcome. In order to ascertain whether the signals were really prediction error signals, we performed a hemodynamic deconvolution of the effect of the prediction (blue line). These effects can be seen in the bottom panels of main figures 2b,d. We assumed that the trial could be modelled by hemodynamic response functions (hrfs) at

www.nature.com/nature

15

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

5 characteristic times 1) the initial cue starting the trial; 2) the confederate suggestion; 3) the decision time; 4) the outcome. 5) The ITI (not shown). In each subject we then fit the BOLD effect of prediction (blue line in top panel) with these 5 hrfs using a general linear model. A prediction error signal should show a significant negative effect of the 4th hrf (after the outcome was revealed). This was true for the social prediction error signals (t(22)=2.68 (p<0.005), 2.35 (p<0.05), 3.27 (p<0.005) for DMPFC, right MTG and right STS/TPJ respectively), and for the ventral striatal signal for reward prediction error (t(22)=2.50, p<0.05). The social regions showed a significant positive effect of lie prediction during the 3rd hrf (t(22)=1.96 (p<0.05), 1.73(p<0.05), 1.74(p<0.05) for DMPFC, right MTG and right STS/TPJ respectively). The ventral striatum showed a significant positive effect of reward prediction during the 2nd hrf (t(22)=3.32 (p<0.002)). In order to verify the model fit of the 5 hrf model, we plotted the predicted timecourse of effect sizes from these hrfs on top of the observed timecourses (green line in top panels).

Reward prediction errors. As described above, we have shown that the reward prediction error signal in the ventral striatum can be decomposed into a signal that represents a positive effect of the reward, and a negative effect of the previous expectation of reward, thus making the case for it being a reward prediction error (reward minus expectation). Previous studies that have found ventral striatal reward prediction error signals in FMRI data have done so by fitting the FMRI signal with the whole prediction error signal (without dividing it into its constituent parts 13, 14. This strategy allows for the possibility that only one aspect of the reward prediction error (for example the

www.nature.com/nature

16

doi: 10.1038/nature07538

SUPPLEMENTARY INFORMATION

rewarding outcome) might drive the correlation with the FMRI data. Whilst we have confirmed that this is not the case in our data with respect to the rewarding outcome and its expectation (main figure 2d), the reward prediction error itself is closely correlated with a third regressor. This third regressor simply signals the event of a rewarding outcome, independent of the magnitude or expectation of that outcome. In order to be sure that the signal in the ventral striatum represents a reward prediction error, we must therefore subject the data to a further, more difficult, test. We must allow the three potential regressors (Reward magnitude, expectation of reward magnitude and rewarding outcome) to compete for variance in the FMRI data. When we perform this more difficult test, it is no longer clear that there is an effect of reward prediction error in the ventral striatum. Instead it appears as though the majority of the variance in the FMRI signal can be accounted for by the simple effect of rewarding outcomes (grey line in supplementary figure 3). Thus, we find a significant negative effect of expected value in the ventral striatum when outcome is not included as a coregressor (t(22)=2.50, p<0.05), but this effect disappears when outcome is included (t(22)=0.75, p=0.23). To our knowledge, no previous study has subjected the ventral striatal signal to this most difficult test of reward prediction error coding. Nevertheless, this analysis does not preclude the possibility that reward prediction error is encoded in the ventral striatum - it instead means that we cannot conclusively confirm its presence using our current BOLD FMRI data. Intriguingly, despite the interest in the possibility that the ventral striatal BOLD signal reflects a reward prediction error, it has recently been shown that it is especially well correlated with the outcome rather than the outcome prediction error15.

www.nature.com/nature

17

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

a)

Trial Presented

0

Suggestion

5

Response

10 Time (s)

Outcome

15

Effect of expected value Effect of reward magnitude

20

25

b) Trial Presented

0

Suggestion

5

Response

10 Time (s)

Outcome

15

Effect of expected value Effect of reward magnitude Effect of outcome

20

25

Figure S3: Decomposition of ventral striatal signal into constituent parts of the reward prediction error. Signal is taken from ROI shown in main figure 2b). a) Reward and expected value were included as regressors in the analysis. After the outcome there was a significant positive effect of reward (red line ± s.e.m.) and a significant negative effect expected value (red line ± s.e.m.), demonstrating an apparent prediction error coding (reward – expectation). b) However, when the trial outcome (a binary variable representing the presence or absence of a reward) is included as a coregressor, this outcome signal (grey line ± s.e.m.) accounts for the majority of the variance in the BOLD data that was previously explained by the prediction error.

BOLD responses to volatility and expected value of chosen action reflect the degree to which subjects weigh each source of information in their behaviour. Our previous logistic regression analysis demonstrated that subjects use all three information sources to guide behaviour. This logistic model was convenient as it allowed us to perform a statistical test. However this model does not respect the fact

www.nature.com/nature

18

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

that two information sources reflect probabilities and the third reflects magnitude. Here we construct a model that incorporates this knowledge to gain a more accurate estimate of how much weight is given to each information source. We then show that the weight given to reward information in subject behaviour predicts the BOLD signal change to reward volatility in the ACC sulcus (figure 3b). Similarly the weight given to confederate information in subject behaviour predicts the BOLD signal change to confederate volatility in the ACC gyrus (Figure 3c). Optimal behaviour is to compute separately the probability of the next outcome given reward history information ( pr ) and the probability of the next outcome given the current confederate advice and the history of confederate truths ( pc ). We have previously described a Bayesian model for optimally tracking these probabilities through the task 1. Subjects should then combine these probabilities into an overall probability ( po ) according to Bayes’ rule:

po =

pr pc pr pc + (1 − pr )(1 − pc )

In order to compute the overall value of the action ( Vo ), subjects should then multiply this overall probability by the reward magnitude available on the action (R), Vo = po R , and select the action with the highest overall value.

In order to account for the fact that subject behaviour is guided to different extents by the different sources of information we included a free parameter for each source of information that allows subjects to either upweight or downweight this probability with respect to the other, and with respect to reward magnitude.

www.nature.com/nature

19

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

We assume a sigmoidal form for this weighting such that for each source of information: p=

1 1 + exp(−γ ( popt − 0.5))

where p is the probability used by the subject and popt is the probability computed by the optimal model. This equation transforms the optimal probabilities such that they are nearer to 0.5 if γ is small (and hence the source of information has less influence on behaviour), and nearer 1 or 0 if γ is big (giving the source of information more influence on behaviour). Subjects are then assumed to generate actions stochastically, according to a further sigmoidal probability distribution (e.g. 11, 14 ):

P (C = Green) =

1 1 + exp(−β (Vgreen − Vblue ))

We fit this model using Bayesian estimation techniques (using direct numerical integration) in order to estimate γ for reward information ( γ r ) and confederate information ( γ c ). These values are then included as subject-wise regressors in the FMRI group design (figure 3).

ROI correlations (main figures 3 and 4) In region of interest cross-subject correlation analyses (main figures 3 b,c and 4 b,c), we perform partial correlations to control for cross-subject correlations between outcome and confederate weightings (which are present due to some subjects being particularly strongly or weakly driven by reward magnitude). This analysis mimics exactly a GLM analysis with both weightings included as regressors (as in figure 3a). Hence, in figure 3b, the signal change in ACCs is plotted against the outcome

www.nature.com/nature

20

doi: 10.1038/nature07538

SUPPLEMENTARY INFORMATION

weighting for each subject, with the effect of confederate weighting removed from both signals. In figure 3c, the signal change in ACCg is plotted against the confederate weighting for each subject, with the effect of outcome weighting removed from both signals. Similarly, in figure 4b, the effect of history-based probability in the VMPFC during the decision is plotted against the effect of history-based volatility in the ACCs during feedback, with the effect of confederate-based volatility in the ACCg during feedback removed from both signals, and the converse is true in figure 4c.

ACC BOLD responses to volatility when observing the outcome of an action predict the degree to which the VMPFC BOLD response reflects each source of information. BOLD signal in the VMPFC predicts the subjective value of the action about to be made (Main figure 4, supplementary figure 4). We have proposed that when observing an outcome, the ACC activity represents the extent to which the information available from this outcome will be used to influence future behaviour (main figure 3). We have shown that ACCs response to outcome volatility and ACCg response to confederate volatility at the time of the outcome reflects the amount that these two sources of information will be used to guide behaviour. These hypotheses about the VMPFC and ACC suggest that activity in the two regions should be closely coupled. Activity in the sulcal and gyral subdivisions of the ACC when an outcome is witnessed should predict the extent that VMPFC activity reflects the value of the upcoming action calculated on the basis of the outcome history and the confederate history respectively. We found a significant correlation between ACCs response to outcome volatility during the MONITOR phase (when the outcome was being witnessed) and VMPFC response to the reward probability computed only on the

www.nature.com/nature

21

doi: 10.1038/nature07538

SUPPLEMENTARY INFORMATION

basis of outcomes during the DECIDE (CUE+SUGGEST) phase (R = 0.7113, p<0.0002, main figure 4b). We found a significant correlation between ACCg response to confederate volatility during the MONITOR phase (when the outcome was being witnessed) and VMPFC response to the reward probability computed only on the basis of confederate history during the SUGGEST phase (R = 0.6119, p<0.002, main figure 4c).

Supplementary figure 4. Individual responses to the reward probabilities computed on the basis of each source of information individually. (a) Response to reward probability during CUE and SUGGEST computed only on the basis of past outcomes (thresholded at Z>2.6, p<0.05 cluster-corrected for VMPFC (cluster size >100 voxels)). No other clusters of >20 voxels were present elsewhere in the brain at this Z-threshold. (b) Response to social advice probability during SUGGEST computed only on the basis of present and past confederate advice (thresholded at Z>2.6, p<0.05 whole-brain cluster corrected). One other region (right posterior cingulate cortex (peak Z =4.56 at MNI 12mm, -20mm, 50mm), shown in right panel) also survived this cluster-based thresholding. (c) Overlap of (a) and (b).

www.nature.com/nature

22

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

Subject Age ranges Unlike many neuroimaging studies, which recruit subjects from a student population, we deliberately recruited subjects from the general public because we were concerned to ensure that subjects did not come into the experiment with preconceptions of how the experiment might be performed. This led to a somewhat wider age range in our sample than is normally used in imaging studies. One concern arising from this was that the wide age range, and associated changes in hemodynamic responses and anatomical structure, might potentially influence the observed results. We addressed this question in two ways. First we confirmed that, using formal statistical tests for outliers and violations of Gaussianity assumptions16, none of the older subjects were identified as outliers in the FMRI analyses. Second, we tested using region of interest analyses whether our results survived the removal of the older subjects in our population. Despite the inevitable loss of statistical power entailed by removal of data from more than 20% of the participants, we were able to confirm our findings by excluding subjects over the age of 30 (n=5), and re-running our analyses on the remaining 18 subjects. All the results presented except one (the three-way ANCOVA interaction of region X volatility type X history weight) remained significant in this re-analysis. The statistical tests from this re-analysis are presented in table S1 Test

Presented in

Social prediction error – negative effect of prediction during outcome (4th h.r.f.)

Fig. 2b (lower panel) and supplementary information

Social prediction error – positive effect of prediction following presentation of advice (3rd h.r.f.)

Fig. 2b (lower panel) and supplementary information

www.nature.com/nature

Value including subjects aged >30 t(22)=2.68 (p<0.005) (DMPFC); 2.35 (p<0.05) (MTG); 3.27 (p<0.005) (STS/TPJ) t(22)=1.96 (p<0.05) (DMPFC); 1.73(p<0.05) (MTG); 1.74(p<0.05)

Value excluding subjects aged >30 t(17)=3.25 (p<0.003) (DMPFC); 2.95 (p<0.005) (MTG); 1.91 (p<0.05) (STS/TPJ) t(17)=2.95 (p<0.005) (DMPFC); 3.96 (p<0.001) (MTG); 1.95 (p<0.05) (STS/TPJ)

23

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

Reward prediction error Fig. 2d (lower – negative effect of panel) prediction during outcome (4th h.r.f.) Reward prediction error Fig. 2d (lower panel) – positive effect of prediction following presentation of decision (2nd h.r.f.) Correlation of outcome Fig. 3b weighting with outcome volatility effect size in ACC sulcus Fig. 3c Correlation of collaborator weighting with collaborator volatility effect size in ACC gyrus Region X reward Supplementary history weight information interaction Region X confederate Supplementary weight interaction information Region X volatility type Main text X history weight interaction Region X volatility type Main text X collaborator weight interaction Fig. 4b Correlation of outcome volatility effect size in ACC sulcus with outcome-based chosen value in vmPFC Fig. 4c Correlation of collaborator volatility effect size in ACC gyrus with collaboratorbased chosen value in vmPFC * approaching significance ** no longer significant

1.

2.

(STS/TPJ) t(22)=2.50 (p<0.05)

t(17)=2.40 (p<0.05)

t(22)=3.32 (p<0.002))

t(17)=2.54 (p<0.02)

R=0.7163 (p<0.0001)

R=0.5244 (p<0.05)

R=0.7252 (p<0.0001)

R=0.6879 (p<0.002)

F(1,19)=4.747 (p<0.05)

F(1,15)=7.623 (p<0 0.02)

F(1,19)=8.770 (p<0.01) F(1,19)=5.379 (p<0.03)

F(1,15)=12.90 (p<0.005) F(1,15)=0.843 (p=0.373)**

F(1,19)=7.145 (p<0.02)

F(1,15)=6.285 (p<0.05)

R = 0.7113 (p<0.0002)

R = 0.4244 (p=0.079)*

R = 0.6119 (p<0.002)

R = 0.5824 (p<0.02)

Behrens, T. E., Woolrich, M. W., Walton, M. E. & Rushworth, M. F. Learning the value of information in an uncertain world. Nat Neurosci 10, 1214-21 (2007). Deichmann, R., Gottfried, J. A., Hutton, C. & Turner, R. Optimized EPI for fMRI studies of the orbitofrontal cortex. Neuroimage 19, 430-41 (2003).

www.nature.com/nature

24

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07538

3. 4.

5. 6. 7.

8. 9.

10. 11.

12. 13.

14. 15.

16.

Smith, S. M. et al. Advances in functional and structural MR image analysis and implementation as FSL. Neuroimage 23 Suppl 1, S208-19 (2004). Jenkinson, M., Bannister, P., Brady, M. & Smith, S. Improved optimization for the robust and accurate linear registration and motion correction of brain images. Neuroimage 17, 825-41 (2002). Smith, S. M. Fast robust automated brain extraction. Hum Brain Mapp 17, 143-55 (2002). Jenkinson, M. Fast, automated, N-dimensional phase-unwrapping algorithm. Magn Reson Med 49, 193-7 (2003). Woolrich, M. W., Ripley, B. D., Brady, M. & Smith, S. M. Temporal autocorrelation in univariate linear modeling of FMRI data. Neuroimage 14, 1370-86 (2001). Jenkinson, M. & Smith, S. A global optimisation method for robust affine registration of brain images. Med Image Anal 5, 143-56 (2001). Woolrich, M. W., Behrens, T. E., Beckmann, C. F., Jenkinson, M. & Smith, S. M. Multilevel linear modelling for FMRI group analysis using Bayesian inference. Neuroimage 21, 1732-47 (2004). Kable, J. W. & Glimcher, P. W. The neural correlates of subjective value during intertemporal choice. Nat Neurosci 10, 1625-33 (2007). Hampton, A. N., Bossaerts, P. & O'Doherty, J. P. The role of the ventromedial prefrontal cortex in abstract state-based inference during decision making in humans. J Neurosci 26, 8360-7 (2006). Daw, N. D., O'Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876-9 (2006). Haruno, M. & Kawato, M. Different neural correlates of reward expectation and reward expectation error in the putamen and caudate nucleus during stimulus-action-reward association learning. J Neurophysiol 95, 948-59 (2006). O'Doherty, J. et al. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 304, 452-4 (2004). D'Ardenne, K., McClure, S. M., Nystrom, L. E. & Cohen, J. D. BOLD responses reflecting dopaminergic signals in the human ventral tegmental area. Science 319, 1264-7 (2008). Woolrich, M. Robust group analysis using outlier inference. Neuroimage 41, 286-301 (2008).

www.nature.com/nature

25

letters

Nov 13, 2008 - the difference in reward magnitudes between options. ..... Full Methods and any associated references are available in the online version of .... Amodio, D. M. & Frith, C. D. Meeting of minds: the medial frontal cortex and social.

885KB Sizes 3 Downloads 197 Views

Recommend Documents

LETTERS
Mar 8, 2007 - variation caused by slicing cells on different planes during .... O'Connor, P. M. & Claessens, P. A. M. Basic avian pulmonary design and flow-.

letters - Library
Nov 23, 2005 - 1CBR Institute for Biomedical Research, 2Department of Pediatrics, 3Department of ... analysis of disease severity over time and between groups showed ... vagina of GFP transgenic mice treated with GFP siRNAs. Data are .... using Slide

letters - PopWars
it lope Foul or cue toto12t. de - Aut EMyCooſ-AGED Alu. Adil MS TAL5 -O-5 TO ENSGQ. ... Jo Full-...-. AE AD out 52 ogg LE ext- existead ce. 3 st. Yoo (Ulul D. 5 Cooe (2 CAAt Lot co-pullcazeto 'food. Pac-. O2, Luebe alled - od... O 312 peers to Pat.

letters - PopWars
Yoo (Ulul D. 5 Cooe (2 CAAt Lot co-pullcazeto 'food. Pac-. O2, Luebe alled - od... O 312 peers to Pat. CO2 u. A 2.5 To Ft. Grit. CO2 Poustic) To C2AMyse.

letters
Jul 5, 2007 - experimental evidence that directly tests the opposing hypotheses. In ... of mimicry often use time-based foraging4, we analysed the data to simulate a ..... transformation to satisfy requirements of parametric statistics. The data ...

letters
Jun 14, 2007 - patterns in terrestrial carbon storage due to deposition of fossil fuel nitrogen. Ecol .... cial software (Paint Shop Pro 4.12, JASC) and individual datapoints digitized .... factor was used in the comparison with N fertilization rates

letters
Aug 21, 2008 - expression of self-destructive cooperation can evolve if individuals ..... Rodemann, J. F., Dubberke, E. R., Reske, K. A., Seo, D. H. & Stone, C. D. ...

letters
size selection and the development of dune patterns. Very few barchans in a dune field exhibit the ..... Mars_Express/l (23 June 2005). 18. Mars Rover Blog.

Letters
these two countries published in Bi- ological Conservation, Biodiversity ... able solutions are unlikely to be found and the limited funds for conserva- tion are ...

letters
morphogenesis of the Drosophila tracheal system. Annu. Rev. Cell Dev. Biol. 19, ... cell fate in the dorsal branches of the Drosophila trachea. Mech. Dev. 87,.

pdf-12104\the-spiritual-letters-of-archbishop-fenelon-letters-to ...
... of the apps below to open or edit this item. pdf-12104\the-spiritual-letters-of-archbishop-fenelon-letters-to-women-by-francois-de-salignac-de-fenelon.pdf.

letters
Oct 18, 2006 - stem cells and brain cancer stem cells3–6, is enriched after radiation in gliomas. In both cell ... mentary Figs S6 and S7a). CD1331 cells derived ...

letters
Oct 18, 2006 - Student's t-test or analysis of variance with a 5 0.05 (GraphPad software). All ... Oostendorp, R. A., Audet, J., Miller, C. & Eaves, C. J. Cell division tracking and ... DNA damage response as a candidate anti-cancer barrier in.

Letters of -
Among the four Gospels, it is only the. Gospel of John that explicitly states that Jesus carried the cross himself: Then he handed him over to them to be crucified.

letters
Jul 6, 1998 - tal setups such as Rydberg systems even in the presence of external fields, femtosecond chemistry, ... independent case. Consider a one-dimensional system that is classically bounded and stable. ..... (1994); B.M. Garraway and K.A. Suom

europhysics letters
Française de Physique, the Società Italiana di Fisica and the Institute of Physics (UK) and owned now by ... Taking full advantage of the service on Internet,.

letters
Oct 1, 2008 - either a high (1) or low (2) copy number plasmid. Cpn, C. .... 800. 0. 50. 100. 150 0. 50. 100. 150 0. 50. 100. 150 0. 50. 100. 150. 1N. 2N ..... removal of the att cassette containing the chloramphenicol resistance gene and.

Letters
by e-mail, [email protected], or the Health Affairs website, http:// www.healthaffairs.org. doi: 10.1377/hlthaff.2013.0279. Technologies For Patient.

letters - Universidade de Lisboa
Jul 10, 2008 - B-1050 Brussels, Belgium. 2ATP-group, Centro de ..... 1 centre. M – 2 leaves. Left star. Right star. Double star. Cooperators. Defectors. Figure 3 | Dynamics on the double star. a, When a single C (blue) occupies the centre of the le

letters
are equal to the joint two-mode Q function and the Wigner function. This assigns a ... A fundamental step providing a bridge between classical and quantum ...

letters
Jul 10, 2008 - PGGs in the context of evolutionary graph theory22, and in the pre- ..... average over 2000 runs for populations of sizes N=500,1000,5000.

letters
Feb 15, 2007 - best strategy for preserving feature diversity in the Cape. We should be ... and created an inventory of species and genera per quarter-degree.

Letters of Application.pdf
Enclosed is my resume in response to your advertisement for a Group Vice President. I found the wording of your advertisement with emphasis on leadership, ...