22/04/2016

CONTEXT HOW TO COMPUTE POWER FOR TEST-RETEST DESIGNS? APPLICATION TO FMRI SPATIAL MAP IN THE CONTEXT OF PRE-SURGICAL MAPPING CYRIL PERNET, PHD NEUROIMAGING SCIENCES, CENTRE

FOR

CLINICAL BRAIN SCIENCES

SCHOOL OF CLINICAL SCIENCES, THE UNIVERSITY OF EDINBURGH.

MEASURING RELIABILITY

• BRAIN TUMOURS TREATMENT MOST OFTEN CONSIST IN SURGICAL REMOVAL OF THE TUMOUR • FMRI IS USED TO MAP ELOQUENT CORTICAL AREAS, I.E. WE WANT TO FIND AREAS NEAR THE TUMOUR THAT ARE ACTIVE IN MAJOR BEHAVIOURAL TASKS (WALKING, TALKING, UNDERSTANDING .. ) AS TO AVOID DAMAGE DURING SURGERY  TWO PROBLEMS • MULTIPLE TASKS ARE LIKELY TO ACTIVATE THE SAME AREA (TASK  FMRI BRAIN ACTIVATION HAS A LOW SPECIFICITY WITH REGARDS TO BEHAVIOUR). WE NEED TO DEFINE PROTOCOLS THAT ACTIVATE WITH HIGH SENSITIVITY/SPECIFICITY THE AREAS OF INTEREST, SPECIFICITY OF BEHAVIOUR IS NOT OF INTEREST (SIMPLY NEEDS TO BE EASY FOR PATIENTS) • WE ARE DEALING WITH MULTIVARIATE TIME SERIES (SEE TOM’S TALK) AND STATISTIC MAPS RATHER THAN STATIC DIRECT MEASUREMENTS

TIME SERIES CORRELATIONS

• GOAL IS TO USE RELIABILITY TO DEFINE GOOD PROTOCOLS THAT CAN BE USE IN THE CONTEXT OF SURGICAL MAPPING (RELIABLE PROTOCOL AT THE SINGLE SUBJECT LEVEL = PROTOCOL WITH HIGH SENSITIVITY/SPECIFICITY TO MAP AREAS OF INTEREST) • ASSUME GOOD CO-REGISTRATION AND TIME-SERIES MOTION CORRECTION AND ALSO ASSUMES DATA ARE CORRECTLY MODELLED (NOT DISCUSSED HERE)

• EXAMINE BETWEEN SESSION VARIANCE • LOW BOUND ON RELIABILITY BEFORE MODEL FITTING • DOESN’T DEFINE IF A PROTOCOL OF GOOD OR NOT, BUT CAN BE USED TO SHOW A PROTOCOL LEADS TO SMALLER VARIATIONS THAN ANOTHER

• SHOWS REGIONAL DIFFERENCE SUCH AS REGIONS THAT RESPOND TO THE TASK SHOW HIGHER CORRELATIONS

• TIME SERIES CORRELATIONS: FOR EACH VOXEL SEE HOW TIME SERIES CORRELATE • RELIABILITY OF SPMS (ICC AND T VALUE DIFFERENCE) • RELIABILITY OF THRESHOLDED SPMS (DICE COEF = OVERLAP OF BINARY MAPS)

RELIABILITY OF UNTHRESHOLDED STAT MAPS

RELIABILITY OF THRESHOLDED STAT MAPS

• I NTRACLASS CORRELATION COEFFICIENT

• PERCENTAGE OVERLAP OF BINARIZED MAP

• PROCESS FMRI DATA AND OBTAIN T MAPS FOR EACH SUBJECTS AND SESSIONS

• CLUSTER OVERLAP MEASURES DECREASE WITH INCREASING THRESHOLD

• ICC = VAR(SUBJECTS) / VAR(SUBJECTS)+VAR(SESSION)

• OVERLAP SCORES ARE DEPENDENT ON THE VOLUME OF ACTIVATION

• STRONG INFLUENCE OF SUBJECT VARIANCE: ASSUMES THE SAME VAR(SESSION) THEN STUDIES WITH LARGE VAR(SUBJECTS) BECOMES MORE RELIABLE !

• WHEN DIFFERENT THRESHOLDS ARE USED FOR THE WHOLE BRAIN, DIFFERENT ACTIVATION MAPS CAN BE OBTAINED, BUT SIMILAR MEASURES OF OVERLAP CAN BE OBSERVED

• U SELESS IN THE CONTEXT OF VALIDATING PROTOCOLS

• SENSITIVE TO BORDERLINE CASES: TWO VERY SIMILAR T MAPS, ONE SLIGHTLY ABOVE A THRESHOLD AND ANOTHER SLIGHTLY BELOW, WOULD GIVE A FALSE IMPRESSION OF HIGH

• MEAN SQUARE OF T

VALUE DIFFERENCES: (T1-T2)2

/N

• SAME AS VAR(SESSION) IN ICC WITHOUT LEARNING EFFECT

VARIABILITY

 YET, FOR SURGERY WE SOMETIMES TIMES NEED TO KNOW IF THE TUMOUR HAS INFILTRATED OR PUSHED THE ‘ACTIVE’ TISSUE NEARBY – THRESHOLDED MAPS ARE WHAT IS NEEDED

• U SEFUL BECAUSE ESTIMATES ONLY SESSION EFFECT AND N CAN BE SUBJECTS OR VOXELS !

1

22/04/2016

5 TASKS

• MOTOR: MOVE HAND, FOOT, LIPS • LANGUAGE: OVERT AND COVERT VERB GENERATION (SEE A WORD AND FIND THE CORRESPONDING VERB) AND COVERT WORD REPETITION (HEAR A WORD AND REPEAT IT) • SPATIAL ATTENTION: LANDMARK TASK (TELL IS A LINE IS CROSSED IN ITS MIDDLE)

APPLICATION TO 5 TASKS

ADAPTIVE VS. FIXED STATS THRESHOLDS

• TASKS DESIGNED TO ELICIT ACTIVATIONS IN AREAS KNOWN TO CAUSE MAJOR DEBILITATING DEFICITS IF DAMAGED (HEMIPLEGIA, APHASIA, HEMINEGLECT)

RESULTS (1)

Because the SNR changes all the time, even in a given subject, changing the statistical threshold between sessions is necessary to achieve high DICE reliability. A major cause of false negatives are ‘globals’ leading to an overall shift of the ‘activation’ (T-tests) distribution.

Difference full brain vs roi for correlations and dice (regional specificity)

Gorgolewski et al. NeuroImage (2013). Single subject fMRI test–retest reliability metrics and confounding factors

QUESTION (1) • WE WANT TO TEST A NEW SET OF TASKS • PRIOR DATA TELLS US WHERE WE SHOULD FIND ACTIVATIONS – WE MAY EVEN HAVE ACCESS TO RAW DATA AND/OR T MAPS FOR SOME PROTOCOLS

RELIABILITY OF PERCENTAGE OVERLAP • NORMALIZE DATA IN SPACE INTO A COMMON TEMPLATE • BINARIZE DATA USING ADAPTIVE THRESHOLD • COMPUTE WITHIN SUBJECT OVERLAP AND MEAN BETWEEN SUBJECT OVERLAP

• HOW MANY SUBJECTS DO I NEED TO HAVE REGIONAL SPECIFICITY?

• COMPARE MEDIAN VALUES USING BOOTSTRAP

• T-TEST WHOLE BRAIN BETWEEN SESSIONS MEAN CORRELATION VALUE VS REGION OF INTEREST CORRELATION VALUE  IF ONLY SINGLE SESSION DATA AVAILABLE, CAN WE INFER SOMETHING? • HOW MANY SUBJECTS DO I NEED TO HAVE GOOD ESTIMATE OF PERCENTAGE OF OVERLAP? • NEED TO COMPARE PERCENTAGE OVERLAP BETWEEN SESSIONS VS BETWEEN SUBJECTS Pernet et al. (2016) Evaluation of a presurgical functional MRI workflow. Int Med Info 86

2

22/04/2016

MOTOR TASK IS GOOD • HIGHER RELIABILITY WITHIN THAN BETWEEN SUBJECTS:

MOTOR TASK IS GOOD • HIGHER RELIABILITY WITHIN THAN BETWEEN SUBJECTS FOR ALL CONTRASTS.

 FINGER: 0.6 VS. 0.34 DIFFERENCE [0.13 0.32] P = 0.004  FOOT: 0.53 VS. 0.32 DIFFERENCE [0.08 0.29] P = 0.0006  LIPS: 0.46 VS. 0.26 DIFFERENCE [0.06 0.3] P = 0.004

Maximum T values of the RFX analyses matched maximum ICC values and maximum single subject maps: consistent signal increase across subjects (average over sessions, similar amplitudes in both sessions and strong enough to be seen in all subjects)

OVERT WORD REPETITION IS GOOD • HIGHER RELIABILITY WITHIN (0.45) THAN BETWEEN (0.22) SUBJECTS: DIFFERENCE [0.13 0.32] P = 0

Maximum T values of the RFX analyses matched maximum ICC values over Wernicke area and maximum single subject maps: consistent signal increase across subjects (average over sessions, similar amplitudes in both sessions and strong enough to be seen in all subjects – Note the difference of peaks over the right homologue)

OVER VERB GENERATION IS NOT SO GOOD • HIGHER RELIABILITY WITHIN (0.32) THAN BETWEEN (0.06) SUBJECTS: DIFFERENCE [0.03 0.44] P = 0.02

RFX analyses / ICC values (even negative) / staked overlaps very different : signal increase across subjects (average over sessions) but amplitudes can change substantially (ICC) such as little consistency across sessions can be observed.

COVER VERB GENERATION IS GOOD • HIGHER RELIABILITY WITHIN (0.55) THAN BETWEEN (0.24) SUBJECTS: DIFFERENCE [0.06 0.41] P = 0

Maximum T values of the RFX analyses slightly different from maximum ICC values or staked overlaps: still quite consistent signal increase across subjects (average over sessions, similar amplitudes in both sessions and strong enough to be seen in all subjects – Note the absence of right frontal activation at the group level).

LANDMARK TASK IS BAD • NOT MORE RELIABLE WITHIN (0.12) THAN BETWEEN (0.09) SUBJECTS: DIFFERENCE [-0.01 0.19] P = 0.74

RFX analyses / ICC values (even negative) / staked overlaps very different. : signal increase across subjects (average over sessions) but amplitudes can change substantially (ICC) such as little consistency across sessions can be observed.

3

22/04/2016

QUESTION (2)

• I CAN JUST COMPUTE THE POWER FOR A T-TEST DICE WITHIN VS BETWEEN • AS BEFORE, IF WE HAVE ACCESS TO A SINGLE SESSION ONLY, IS THERE A WAY TO ESTIMATE SOMETHING SIMILAR BASED ON BETWEEN SUBJECT OVERLAP ONLY? • GIVEN THAT OVERLAP RELIABILITY IS HIGH FOR A RANGE OF VALUES, HOW BEST ESTIMATE POWER? (EVEN ASSUMING TWO SESSIONS ARE AVAILABLE)

4

Pernet-Test-retest-fmri-power - Oxford2016.pdf

COMPARE MEDIAN VALUES USING BOOTSTRAP. Pernet et al. (2016) Evaluation of a presurgical functional MRI workflow. Int Med Info 86. Page 2 of 4 ...

2MB Sizes 0 Downloads 144 Views

Recommend Documents

No documents