Alden Gross 16 Aug 2012 Which ICC should we use for our functional composites study? Analysis goals Here, I use data provided in Shrout and Fleiss (1979) on ratings for 6 targets from 4 judges. (1) I calculate ICCs using their formulae, along the way testing what icr11.ado does. (2) I then apply the equations to a study we conducted in which 10 expert clinicians were asked to rate cognitive, physical, and independent loadings (3 sets of ratings) for each of 25 IADL/ADL items. Summary of results ICC(2,k) to describe agreement of the mean judge rating for items. ICC(2,1) to describe agreement of judges for a particular individual item’s rating. Background We report we are using an ICC(1,1) to describe reliability of a single rater. I will show later that I think that icr11.ado is calculating an ICC(3,1). This is easy to do: the form of the equations is identical. ICC(1,1) describes reliability in a study when each of a bunch of items (or subjects) are rated by a unique set of judges (or raters). That is, not all judges necessarily rate each item. This is an aspect of study design, and the ICC(1,1) can be ruled out quickly if your study did not use this design. ICC(2,1) assumes a random set of judges from a population of judges have each rated all the items. For example, the same judges rate cognitive load for every item in a functional battery This is our situation in SAGES. ICC(3,1) is similar, but judges are fixed effects because we have sampled from all possible judges (example: all 50 states vote on a constitutional amendment; there are 50 states and we have no need to generalize to the 51st state. So the judge, states, are fixed effects). Depending on one’s explicit purpose, ICC(2,1) and ICC(3,1) can be calculated together: the former can be described more as a measure of agreement while the latter measures consistency across judges. ICC(3,1) is usually larger because it does not care about random judges having been selected from the population. In addition to cases 1, 2, 3, we can describe reliability of an individual rater, ICC( , 1) or of the mean rating of judges, ICC( , k). This is an interpretative issue, but I would think we want to describe the mean rating among a set of judges in SAGES, since we will use the composites and not individual judge ratings later on for Thurstone scaling. Thus, ICC(2,k). Shrout and Fleiss (1979) provide a dense description of the ICC. They note (pg 423-424), ”It is not likely that ICC(2,1) or ICC(3,1) will ever be erroneously used in a case 1 study, since the appropriate mean squares would not be available. The misuse of ICC(1,1) on data from Case 3 1
or Case 3 studies is more likely. A consequence of this mistake is the underestimation of the true correlation...”
2
Here are ICCs based on data provided in Table 2 of Shrout and Fleiss (1979) on ratings for 6 targets from 4 judges. The correct ICCs are provided in Table 4 of their paper. The calculations agree (we also validated the equations using data from Shrout’s chapter in Psychiatric Epidemiology). . . webuse judges (Ratings of targets by judges) . anova rating judge target
Source
. . . . .
Number of obs = 24 Root MSE = 1.00968 Partial SS df MS
R-squared = 0.9095 Adj R-squared = 0.8612 F Prob > F
Model
153.666667
8
19.2083333
18.84
0.0000
judge target
97.4583333 56.2083333
3 5
32.4861111 11.2416667
31.87 11.03
0.0000 0.0001
Residual
15.2916667
15
1.01944444
Total
168.958333
23
7.34601449
* note: JMS was tough. * From SF1979, WMS=ems + (jms-ems)/n. * Check out table1, middle column, do algebra local bms = (`e(ss_2)´ / `e(df_2)´)
. local ems = (`e(rss)´/`e(df_r)´) . local jms = (`e(ss_1)´/`e(df_1)´) . local wms = (`e(rss)´/`e(df_r)´) /// 6.26 in SF1979 > + (`e(ss_1)´/`e(df_1)´ /// > - `e(rss)´ / `e(df_r)´)/(`e(df_2)´+1) . . . * ICC(1,1). should be 0.17, per Table 4. . display "ICC(1,1): " _c ICC(1,1): . display (`bms´ - `wms´ ) /// > / (`bms´ + `e(df_1)´*(`wms´)) .16574177 . . * ICC(2,1). should be 0.29, per Table 4. . display "ICC(2,1): " _c ICC(2,1): . display (`bms´ - `ems´ ) /// > / (`bms´ /// > + `e(df_1)´ * `ems´ /// > + (`e(df_1)´+1)*(`jms´ - `ems´) / (`e(df_2)´+1) .28976378 . . * ICC(3,1). should be 0.71, per Table 4. . display "ICC(3,1): " _c ICC(3,1): . display (`bms´ - `ems´ ) /// > / ( `bms´ /// > + `e(df_1)´ * `ems´ ) .71484071 . . * ICC(1,k). should be 0.44, per Table 4. . display "ICC(1,k): " _c ICC(1,k): . display (`bms´ - `wms´ ) / `bms´ .44279713 . . * ICC(2,k). should be 0.62, per Table 4.
3
)
. display "ICC(2,k): " _c ICC(2,k): . display (`bms´ - `ems´ ) /// > / (`bms´ /// > + (`jms´ - `ems´)/(`e(df_2)´+1) ) .62005055 . . * ICC(3,k). should be 0.91, per Table 4. . display "ICC(3,k): " _c ICC(3,k): . display (`bms´ - `ems´ ) / (`bms´) .90931554 .
So, what is icr11.ado doing? It appears to be ICC(3,1). This is usually larger than ICC(1,1) and likely larger than but similar in magnitude to ICC(2,1) since ICC(2,1) has additional uncertainty of random raters. This is easy to do by mixing up JMS with EMS in ANOVA because the equations are otherwise the same. . version 10 . icr11 , rating(rating) rater(judge) case(target) anova (Using anova) Number of obs = 24 R-squared = Root MSE = 0 Adj R-squared =
ICR(1,1) =
Source
Partial SS
df
MS
Model
168.958333
23
7.34601449
target judge target*judge
56.2083333 97.4583333 15.2916667
5 3 15
11.2416667 32.4861111 1.01944444
Residual
0
0
Total
168.958333
23
7.34601449
0.715
The intraclass correlation for a single rater [ICR(1,1)] describes the reliability of a single randomly selected rater. The result can be interpreted as the percent of the variance of a single rater´s ratings that are attributable to systematic differences between cases.
4
F
1.0000 Prob > F
What do these ICCs look like for the SAGES functional composites study?
. use $derived/fxncomp-208-ratings.dta, clear . quietly Composite ICC(1,1): ICC(2,1): ICC(3,1): ICC(1,k): ICC(2,k): ICC(3,k):
foreach t in 1 2 3 { type 1 .61593358 .62128812 .7219388 .94130478 .94254623 .96291256
Composite ICC(1,1): ICC(2,1): ICC(3,1): ICC(1,k): ICC(2,k): ICC(3,k):
type 2 .66668327 .66868278 .71135592 .95238434 .95279134 .96100565
Composite ICC(1,1): ICC(2,1): ICC(3,1): ICC(1,k): ICC(2,k): ICC(3,k):
type 3 .68435392 .68594985 .72247879 .95591034 .95622109 .96300856
5
Bonus material: Cronbachs Alpha is mathematically equivalent to the ICC for the mean of multiple observations with fixed raters/items, ICC(3,k). . foreach type in 1 2 3 { 2. display "Composite type `type´" 3. preserve 4. keep if type==`type´ 5. drop stub u lab name 6. reshape wide nu, i(item) j(raterid) 7. alpha nu* 8. restore 9. } Composite type 1 (500 observations deleted) (note: j = 1 2 3 4 5 6 7 8 9 10) Data Number of obs. Number of variables j variable (10 values) xij variables:
long
->
wide
250 4 raterid
-> -> ->
25 12 (dropped)
nu
->
nu1 nu2 ... nu10
Test scale = mean(unstandardized items) Average interitem covariance: Number of items in the scale: Scale reliability coefficient: Composite type 2 (500 observations deleted) (note: j = 1 2 3 4 5 6 7 8 9 10) Data Number of obs. Number of variables j variable (10 values) xij variables:
.0679642 10 0.9629
long
->
wide
250 4 raterid
-> -> ->
25 12 (dropped)
nu
->
nu1 nu2 ... nu10
Test scale = mean(unstandardized items) Average interitem covariance: Number of items in the scale: Scale reliability coefficient: Composite type 3 (500 observations deleted) (note: j = 1 2 3 4 5 6 7 8 9 10) Data Number of obs. Number of variables j variable (10 values) xij variables:
.0748336 10 0.9610
long
->
wide
250 4 raterid
-> -> ->
25 12 (dropped)
nu
->
nu1 nu2 ... nu10
Test scale = mean(unstandardized items) Average interitem covariance: Number of items in the scale: Scale reliability coefficient:
.1143844 10 0.9628
6
There’s an official Stata ado, icc.ado, that calculates all the ICCs from Shrout and Fleiss and has some other options to confuse you further: . webuse judges (Ratings of targets by judges) . * ICC(1,1 and k ) . icc rating target Intraclass correlations One-way random-effects model Absolute agreement Random effects: target
Number of targets = Number of raters =
rating
ICC
Individual Average
.1657418 .4427971
F test that ICC=0.00: F(5.0, 18.0) = 1.79
6 4
[95% Conf. Interval] -.1329323 -.8844422
.7225601 .9124154
Prob > F = 0.165
Note: ICCs estimate correlations between individual measurements and between average measurements made on the same target. . * ICC(2,1 and k ) . icc rating target judge, absolute Intraclass correlations Two-way random-effects model Absolute agreement Random effects: target Random effects: judge
Number of targets = Number of raters =
rating
ICC
Individual Average
.2897638 .6200505
6 4
[95% Conf. Interval] .0187865 .0711368
.7610844 .927232
F test that ICC=0.00: F(5.0, 15.0) = 11.03 Prob > F = 0.000 Note: ICCs estimate correlations between individual measurements and between average measurements made on the same target. . * ICC(3,1 and k ) . icc rating target judge, consistency Intraclass correlations Two-way random-effects model Consistency of agreement Random effects: target Number of targets = Random effects: judge Number of raters = rating
ICC
Individual Average
.7148407 .9093155
F test that ICC=0.00: F(5.0, 15.0) = 11.03
6 4
[95% Conf. Interval] .3424648 .6756747
.9458583 .9858917
Prob > F = 0.000
7