STATISTICS AND RESEARCH DESIGN
Sample calculations for comparing proportions Nikolaos Pandis, Associate Editor of Statistics and Research Design Bern, Switzerland, and Corfu, Greece
I
n the previous article, we introduced the concepts of power and type I and type II errors and gave an example of the required steps for sample-size calculations for comparing 2 means. In this article, we will perform a sample calculation for comparison of 2 proportions. Let us briefly remind ourselves of the information we need before we proceed with the example:
The research question. The principal outcome measure of the trial. p1, the anticipated proportion on the standard or control treatment. p2, the anticipated proportion on the alternative treatment and hence the minimum clinically important difference (p2 – p1) between treatment arms that we would like to detect. The degree of certainty that we want to detect the treatment difference (power) and the level of significance (type I error). The required steps are the same as for sample-size calculations for comparing 2 means except that, when we use proportions, we do not need a standard deviation, and we use a different formula. We are interested in conducting a trial in which we will compare overall lingual retainer failure over a 24month period between retainers bonded with conventional acid etching vs self-etching primers. The sequence of constructing this study will be as follows. 1.
We must decide on what is an acceptable difference to be observed that has clinical importance. Selecting a difference to observe that is too small and not clinically important will increase the required sample size to impractical levels. Let us assume that we consider a difference in the proportion of failures of 15% an important clinical difference and that if self-etching primers, which are less moisture sensitive, can achieve this reduction in failures over 24 months, they might be worth a second look.
Am J Orthod Dentofacial Orthop 2012;141:666-7 0889-5406/$36.00 Copyright Ó 2012 by the American Association of Orthodontists. doi:10.1016/j.ajodo.2012.02.001
666
2.
3.
We must make assumptions regarding the expected proportion of failures in the control arm, which will be the conventional acid etching group in this example. Two sources that could help us determine the expected proportion of failures in the control arm could be previous published studies or a pilot of the trial that we are designing. Lie Sam Foek et al1 found that the proportion of failures was around 38% for the entire follow-up period with conventional etching. If we want to detect an absolute 15 percentage points of reduction in failures in the self-etching group, we will assume 23% of total failures over the entire follow-up period. We decide on an alpha level of 0.05, or 5%, and a power of 0.90, or 90%.
So far, we have the proportion of failures in the control arm (38%), the minimum difference to be detected (15% less), and the desired significance and power levels. To carry out this calculation, we will use the formula described by Pocock,2 which assumes independently distributed outcomes, equal numbers of participants per arm, no losses to follow-up, and no continuity correction. n5f ða; bÞc
p1cð1 p1Þ1p2cð1 p2Þ 2 ðp1 p2Þ
where n is the required sample size per trial arm, p1 is the anticipated percentage of failure in those on standard treatment, p2 is the anticipated percentage of failure in those on the new treatment, and f (a, b) is a function of a and b. The Table provides the values of f (a, b) for different a and b values. We substitute the selected values in the formula. n510:5
0:38 ð1 0:38Þ10:23 ð1 0:23Þ 5193 ð0:38 0:23Þ2
This indicates that the required sample size to detect a 15% difference in the proportion of failures between the conventional acid etching and the self-etching group over a 24-month period with 90% power and at the 5% level of significance would be 193 per treatment arm, for a total of 386 patients. If we reduce the power to 80%, the required sample size per treatment arm would be the following.
Pandis
667
Table. Values for different combinations of power and level of significance, adapted from Pocock
2
b a
n57:85
0.05
0.05 (95% power) 13.0
0.1 (90% power) 10.5
0.2 (80% power) 7.85
0.5 (50% power) 3.84
0.01
17.8
14.9
11.7
6.63
0:38 ð1 0:38Þ10:23 ð1 0:23Þ 5144 2 ð0:38 0:23Þ
This indicates that the required sample size to detect a 15% difference in the proportion of failures between the conventional acid etching and the self-etching groups over a 24-month period with 80% power and at the 5% level of significance would be 144 per treatment arm for a total of 288 patients. By playing with different scenarios, we can see that the required number of patients increases as the proportion of the events in the 2 arms is closer to the 0.5 value, the difference to be detected decreases, the level of significance decreases, and the power increases. Again, we must be sensible in our assumptions. We can decrease the required sample size to 36 per treatment arm, or a total of 72, if we set the difference in the proportion of failures to be detected at 30% instead of 15%, keeping the power at 90% and the level of significance at 5%. However, is the 30% difference a sensible clinical assumption? Setting large differences to be detected artificially decreases the required sample size and is not appropriate if the size of the difference to be detected is not realistic in a clinical setting. In the above example, sample calculations were based on the proportion of failures per treatment group. Another approach would have been to use rates, which are a more appropriate measure when survival time is calculated. Proportions indicate the total number of
failures over the entire follow-up period and do not incorporate a time element, whereas rates, which are often used erroneously in orthodontics instead of proportions of failure, include a time element. Sample calculations are often conducted for the main comparison and not for subgroups. Comparisons between subgroups—eg, between only males or females for the lingual retainer adhesive groups—are not powered to detect the differences assumed for the main comparisons. Ideally, sample calculations should account for all intended comparisons at the design stage. KEY POINTS
Sample calculations should be based on clinically meaningful differences, consider previous knowledge, and balance statistical precision, trial feasibility, and credibility. Power is considered at the design stages, and it has no value after the trial is conducted. Sample calculations for proportions are similar to sample calculation for means but require a different formula and no standard deviation. REFERENCES 1. Lie Sam Foek DJ, Ozcan M, Verkerke GJ, Sandham A, Dijkstra PU. Survival of flexible, braided, bonded stainless steel lingual retainers: a historic cohort study. Eur J Orthod 2008;30:199-204. 2. Pocock SJ. Clinical trials: a practical approach. Chichester, United Kingdom: Wiley; 1983. p. 125-9.
American Journal of Orthodontics and Dentofacial Orthopedics
May 2012 Vol 141 Issue 5