Measurement, 8: 11–15, 2010 Copyright © Taylor & Francis Group, LLC ISSN: 1536-6367 print / 1536-6359 online DOI: 10.1080/15366361003684653

Who Needs Linear Equating Under the NEAT Design?

Measurement, Vol. 8, No. 1, April 2010: pp. 0–0 1536-6359 1536-6367 HMES Measurement

COMMENTARIES Commentaries

Gunter Maris CITO—University of Amsterdam

Verena D. Schmittmann and Denny Borsboom University of Amsterdam

Test equating under the NEAT design is, at best, a necessary evil. At bottom, the procedure aims to reach a conclusion on what a tested person would have done, if he or she were administered a set of items that were in fact never administered. It is not possible to infer such a conclusion from the data, because one simply has not made the required observations. Therefore one has to base the inference on strong theoretical claims (e.g., the distinct sets of items measure the same attribute in the same way, etc.) that are not testable in the NEAT design. As a consequence, the inferences made about people’s abilities, and the decisions that are based on these inferences, rest on untested theoretical assumptions—a situation that is clearly undesirable. In principle, one should therefore advise against using a NEAT design whenever that can be avoided. Should, however, the NEAT design be necessitated by practical or pragmatic concerns, then the use of linear equating methods as proposed and discussed in the present set of papers will, in our view, unnecessarily compound the problems inherent to the procedure. As we will argue, all equating procedures depend on assumptions regarding the appropriateness and invariance of a measurement model that, if correct, imply that the linear equating procedures are unlikely to work well. In addition, in the case where an invariant latent variable underlies the responses in both groups, that model offers an easier and more justifiable means of equating. WHY LINEAR EQUATING IS A BAD IDEA Kane and his colleagues start head on with an introduction to linear equating methods and their underlying assumptions (Kane et al., 2009). It would have been wise, however, to devote some attention to the position of linear equating in the larger landscape of equating methods (e.g., van der Linden, 2002; von Davier & Kong, 2005). Without any attempt at being anywhere near exhaustive we point to some relevant material in the psychometric literature. First, the four criteria for successful equating introduced by Lord (1980) may help in deciding whether or not linear equating is ever a good idea: Correspondence should be addressed to Denny Borsboom, Department of Psychology, University of Amsterdam, Roetersstraat 15, 1018 WB Amsterdam, The Netherlands. E-mail: [email protected]

12

COMMENTARIES

1. 2. 3. 4.

Validity. The tests to be equated measure the same attribute. Equity. It should not matter to a candidate whether he took one test or the other. Invariance. The same measurement model should hold in all populations. Symmetry. The equating function that maps test score X on test A to test score Y on test B must map score Y on test B to score X on test A.

Linear equating methods fare badly when one examines these criteria. For instance, consider the requirement that all of the items measure the same attribute. Clearly this is important, because if this is not the case, then equating the two forms makes about as much sense as equating shoe size with IQ. Now, if the test items measure the same attribute, that means an item response theory (IRT; Lord, 1980) model should hold. But for almost all IRT models, the testcharacteristic curve (that describes the dependence of the total test score on the measured attribute) is nonlinear, and so is the (implicit) relation between (expected) scores on different test forms as a function of ability. Linear equating procedures, however, are linear. Therefore they will rarely be correct. Second, the methods involved assume that observed score regressions are invariant over groups. However, if there are group differences in the distribution of the latent variable that is measured by the items, then, in general, one should not expect the observed score regressions to be equal if the measurement model is invariant over groups (Millsap, 1997, 2007; Borsboom, Romeijn, & Wicherts, 2008). For instance, even assuming linear relations as in the factor model and equal latent variances across groups, there is a systematic relation between the betweengroup differences in latent means and the intercept of the observed score regression: For the group with the higher average on the latent variable, the intercept of the predictive regression will be larger. In this case, predictive regressions can only be equal across groups if there is bias in the measurement model. Third, with respect to the equity criterion, it is useful to note that linear equating is a special case of equipercentile equating. As Lord (1980) shows, however, equipercentile equating methods can meet the equity criterion only if the two tests are parallel, in which case no equating would be needed at all. In realistic situations, therefore, it will matter for the tested person which test was taken, so the equity criterion will be violated. In conclusion, linear equating methods are extremely problematic in principle. The fact that the principal concerns mentioned above are not merely theoretical, without significant impact on practical applications, is clearly supported by the results of the study by Suh et al. (2009), who show us some possible, expected consequences of the principal problems of linear equating methods in a real-data application. The procedures are inherently flawed and conflict with reasonable criteria. Therefore, we doubt whether there should be a future for these methodologies, especially in view of the fact that superior approaches are available to attack the problem (see the next section).

LATENT VARIABLE MODELS Latent variable models allow us to address (and even formulate) some of the criteria for successful equating and offer guidance in developing equating methods. For instance, van der Linden (2000, Proposition 1) shows that observed-score equating methods can (in principle) exist that

COMMENTARIES

13

meet all four criteria of Lord (1980), plus some additional desirable ones, under a very broad range of latent variable models. In a unidimensional IRT model it is assumed that the performance of an examinee on a given test item depends on a latent variable, the underlying ability (Hambleton, Swaminathan, & Rogers, 1991). The relationship between the item performance and the examinee’s ability can be described by a monotonically increasing function, the item characteristic curve. The higher an examinee’s underlying ability, the higher the probability of answering an item correctly. Items may differ for instance in their difficulty parameter. Examples of IRT models are the Rasch model and the two-parameter logistic model. Concurrent estimation permits the researcher to put the item and ability parameter estimates on a common scale without the need for separate linking and scaling steps (Hambleton et al., 1991). The underlying principle is simple: Suppose we collected the responses of groups A and B with group sizes NA and NB on Nic common anchor items, on Nia additional items in the test administered to group A, and Nib additional items in the test administered to group B. Now, the item responses of the participants in the two groups are entered into a large data matrix of dimensions (NA+NB) x (Nic+Nia+Nib). The blocks of item responses to the nonadministered tests are entered as missing values. Given equivalent groups, the data may be considered as missing completely at random, that is, the event that an item response is missing does not depend on observable or unobservable parameters of interest; given group differences in the mean of the latent variable, the data may be considered as missing at random, that is, the event that caused the data to be missing does not depend on the missing data itself (Little & Rubin, 1989). Subsequently, the item parameters may be estimated consistently in a multigroup design by either marginal maximum likelihood (MML) or conditional maximum likelihood (e.g., Chapter 3, Fischer & Molenaar, 1995; Bock & Aitkin, 1981). In a second step the ability parameters may be estimated, typically treating the estimated item parameters as known constants (e.g., Chapter 4, Fischer & Molenaar, 1995). This approach has several advantages. Mroch et al. (2009) point out that their assumption, that either the anchor or the test items are measured without error, will not be met in most cases. This assumption is not required under the latent variable approach. The existence of group differences is accounted for, and may even be tested. Mroch et al. mention possible issues in long chains of equating. Chains of equating are not required in the latent variable approach, as this method may deal with more than two tests simultaneously. Mroch et al. mention the possibility of examinees having prior access to previously used items. The latent variable approach in principle allows one to identify latent groups, with different structural relations between anchor items and remaining test items and, therefore, to detect the examinees one would want to submit to a retest. Clearly, the latent variable approach has distinct advantages. Lest one think, however, that latent variable models offer a panacea to all our problems, we will also critically review this approach. Assume that some latent variable model holds, and that all parameters are known. Equating would then boil down to estimating the ability for every person from his/her responses, or so it seems. Criteria 1, 3, and 4 of Lord (1980) pose few difficulties, as their tenability can be evaluated statistically. The equity principle, however, poses a serious challenge, even for latent variable models. To see this we need to consider the estimation of ability in greater detail. There are various ways in which ability could be estimated, but all of them take the form of a known function mapping a response vector into a real number, say qX ( x ) and qY ( y) , for test forms X and Y, respectively. These estimated abilities would be considered to be the equated scores of

14

COMMENTARIES

the persons. If we now reconsider the equity principle of Lord (1980), which would state in this context that

qX ( X ) | q = qY (Y ) | q,

(1)

st

we readily see that the principle certainly is not automatically met. Estimated ability, even under a fitting IRT model, is in general biased (Warm, 1989) and has a standard error that depends on the true ability in relation to the characteristics of the individual items in the test. In general, neither the bias nor the standard error will be the same for two different test forms. As a consequence, for an individual test taker it might actually matter which of the two test forms he or she actually gets. The situation is similar to the one with equipercentile equating signaled by Lord (1980), for which parallel tests are needed to achieve equity. Basically, we need to be able to match every item from test form X to one from test form Y with the exact same item characteristics, in which case equating the scores would no longer be necessary. Contrary however to the situation where no latent variable model is assumed, we can actually make the equity assumption true via careful test construction (van der Linden, 2000). It is true that this requires that tests are constructed from an item pool for which the operating characteristics of all the items are known, but at least we can in principle achieve equity. Observe that this solution implies that we have to step out of the world of NEAT designs and into the world of item pools. This has as an additional advantage that we can now test the assumption of measurement invariance of the latent variable models in the population in which (under a NEAT design) they would never be administered. CONCLUSION If a measurement invariant latent variable model does not underlie the data, then the entire equating procedure comes down to skating on very thin ice; for it is, in that case, not clear what one is actually doing (certainly not equating the responses of the different groups). In this case, therefore, the linear equating methods proposed would seem to be useless. However, if a measurement invariant latent variable model does underlie the data, then (a) the assumptions of the linear methods are unlikely to be met and (b) there are easier and more appropriate ways of handling the situation, namely through an IRT analysis. The only use we can imagine for linear equating methods of the kind investigated by Kane et al. in the present papers is as first-order approximations to the proper analysis through IRT. To investigate the appropriateness of these approximations, however, one would need to do the IRT analysis anyway, and if one has done that analysis, the linear equating methods are no longer necessary. Our critical discussion of the latent variable approach to test equating shows, however, that if we want to adhere to the criteria for successful equating of Lord (1980), the same conclusion must be reached for equating on the basis of a latent variable model within the NEAT design— even though the latent variable approach has the benefit that it allows us to evaluate these criteria explicitly. Hence, the conclusion must be that equating methods within the NEAT design are quite problematic in general: if they work, they are not needed, and if they are needed, they do not work.

COMMENTARIES

15

REFERENCES Bock, R. D., Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–460. Borsboom, D., Wicherts, J. M., & Romeijn, J. W. (2008). Measurement invariance versus selection invariance: Is fair selection possible? Psychological Methods, 13, 75–98. Fischer, G. H., Molenaar, I. W. (1995). Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag. Hambleton, R. K., Swaminathan, H., Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Kane, M. T., Mroch, A. A., Suh, Y., & Ripkey, D. R. (2009). Linear equating for the NEAT design: Parameter substitution models and chained linear relationship models. Measurement, 7, 125–146. Little, R. J. A., & Rubin, D. B. (1989). The analysis of social science data with missing values. Sociological Methods and Research, 18, 292–326. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Millsap, R. E. (1997). Invariance in measurement and prediction: Their relationship in the single-factor case. Psychological Methods, 2, 248–260. Millsap, R. E. (2007). Invariance in measurement and prediction revisited. Psychometrika, 72, 461–473. Mroch, A. A., Suh, Y., Kane, M. T., & Ripkey, D. R. (2009). An evaluation of five linear equating methods for the NEAT design. Measurement, 7, 174–193. Suh, Y., Mroch, A. A., Kane, M. T., & Ripkey, D. R. (2009). An empirical comparison of five linear equating methods for the NEAT design. Measurement, 7, 147–173. van der Linden, W. J. (2000). A test-theoretic approach to observed-scored equating. Psychometrika, 65, 437–456. von Davier, A. A., & Kong, N. (2005). A unified approach to linear equating for the non-equivalent group design. Journal of Educational and Behavioral Statistics, 30, 313–342. Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.

Who Needs Linear Equating Under the NEAT Design? | Google Sites

from the data, because one simply has not made the required observations. .... and ability parameter estimates on a common scale without the need for separate ... item responses of the participants in the two groups are entered into a large data matrix of .... that analysis, the linear equating methods are no longer necessary.

64KB Sizes 1 Downloads 155 Views

Recommend Documents

Who Needs
cards, and activity accounting. But they have been .... of products. At these companies, an ..... of product launches) and sales targets. Instead, the individual ...

Who Needs
uptime in comparison to best-in-class industry ..... every quarter, signal whether cash fiow is improving or ... office, who make phone calls to a few key people ...

Who Needs a Gun.pdf
a meaningful threat to your own existence. Our periodic shock at mass. shootings 集體槍擊 and gang wars has little effect on 對...沒什麼效 our. gun culture 擁槍 ...

download Who Needs the Fed?: What Taylor Swift ...
What Taylor Swift, Uber, and Robots Tell Us About Money, Credit, and Why We Should Abolish America s Central Bank For android by John Tamny, Read Who ...

11. The One Who Needs and the One Who Eats Obstacles to the ...
analyzes case-study of Holodomor, a state-perpetrated famine, and explains .... One Who Eats Obstacl ... tive By Mohnaa Shrivastava & Abhigyan Siddhant.pdf.

11. The One Who Needs and the One Who Eats Obstacles to the ...
The One Who Needs and the One Who Eats Obstacl ... tive By Mohnaa Shrivastava & Abhigyan Siddhant.pdf. 11. The One Who Needs and the One Who Eats ...

The Solution of Linear Difference Models under Rational ...
We use information technology and tools to increase productivity and facilitate new forms. of scholarship. For more information about JSTOR, please contact ...

11072016 Neat Information Update - Final.pdf
new website, changes/additions to the email distribution and please share the site with others as the. public website is available to the whole wide world.

Design Note 1035 - Linear Technology
Design Note 1035. Tom Domanski. 10/15/1035. Introduction. A single LTC®2983temperature measurement device can support up to 18 2-wire RTD probes, ...