Theory & Psychology http://tap.sagepub.com/

The cat came back: Evaluating arguments against psychological measurement Keith A. Markus and Denny Borsboom Theory Psychology published online 14 October 2011 DOI: 10.1177/0959354310381155

The online version of this article can be found at: http://tap.sagepub.com/content/early/2011/10/11/0959354310381155 Published by: http://www.sagepublications.com

Additional services and information for Theory & Psychology can be found at: Email Alerts: http://tap.sagepub.com/cgi/alerts Subscriptions: http://tap.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav

>> Proof - Oct 14, 2011 What is This?

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

381155

TAPXXX10.1177/0959354310381155Markus and BorsboomTheory & Psychology

Article

The cat came back: Evaluating arguments against psychological measurement

Theory & Psychology 1–15 © The Author(s) 2011 Reprints and permission: sagepub. co.uk/journalsPermissions.nav DOI: 10.1177/0959354310381155 tap.sagepub.com

Keith A. Markus

John Jay College of The City University of New York

Denny Borsboom University of Amsterdam

Abstract The possibility or impossibility of quantitative measurement in psychology has important ramifications for the nature of psychology as a discipline. Trendler’s (2009) argument for the impossibility of psychological measurement suggests a general and potentially fruitful strategy for further research on this question. However, the specific argument offered by Trendler appears flawed in several respects. It seems to conflate what must hold true with what one must know and also equivocate on the necessary evidence. Moreover, if the argument supported its conclusion, it would rule out qualitative discourse on psychology as well as psychological measurement. Taking Trendler’s argument as an example, one can formulate a general structure to arguments adopting the same basic strategy. An overview of the requirements that such arguments should meet provides a metatheoretical perspective that can assist authors in constructing such arguments and readers in critically evaluating them.

Keywords measurement, philosophy of science, psychometrics, quantity, validity

Trendler (2009) presents an argument to the effect that psychological measurement represents a revolution that cannot happen. This contributes to a continuing discussion of the nature and plausibility of quantitative measurement in psychology. For instance, Cliff (1992) lamented the fact that psychologists have not made use of developments in measurement theory (e.g., Krantz, Luce, Suppes, & Tversky, 1971), instead relying on weakly

Corresponding author: Keith A. Markus, Psychology Department, John Jay College of Criminal Justice, The City University of New York, 445 W59th Street, New York, NY 10019, USA. Email: [email protected]

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

2

Theory & Psychology

supported measurement practices; Borsboom (2006) criticized the fact that psychologists do not attend well to developments in psychometrics, and fail to take advantage of the toolbox of techniques developed in that area; and Michell (1990, 1997, 1999) has criticized the presence of untested assumptions in psychometric applications; specifically, the fact that the existence of a quantitatively structured, measurable attribute is commonly assumed but rarely tested, and hence is unsupported by evidence. In earlier decades, one can find similar criticisms of the idea that observational procedures in psychology qualify as measurement (e.g., the Ferguson Committee; see Michell, 1999; Reese, 1943; Stevens, 1946), as well as doubts with respect to the assumption that psychological attributes are measurable at all (e.g., the quantity objection; Bergson, 1913; Michell, 1999). Thus, notwithstanding the fact that common practices in psychology take for granted the idea that psychological attributes are measurable, and in fact are measured by commonly used tests, skepticism about the validity of this assumption has been an undercurrent during the entire history of psychology. Trendler (2009) takes the skeptical attitude a step further, by arguing that measurement in psychology is impossible in principle. As far as we know, this is the most ambitious conclusion hitherto offered in the debate. Trendler’s work addresses an important topic, and is part of a useful research program in clarifying the semantics and practice of measurement and testing in psychology. If correct, the paper would in fact be an extremely important addition to the literature, because it would basically defuse research programs that are predicated on the assumption that psychological measurement is possible; we need not argue that these are many and varied (e.g., large parts of psychometrics, intelligence research, and personality testing, but also of psychonomic research on memory, learning, cognition, etc., would be discredited). However, as we will show, Trendler’s argument, at least when taken at face value, contains flaws; moreover, it is not clear that these flaws can be readily corrected. The present article therefore reflects on some specific weaknesses of his argument and draws some general conclusions about the development and critical evaluation of such arguments in the literature. One can schematize Trendler’s main argument as follows. 1. 2. 3. 4. 5. 6. 7.

Measurement requires that for any two magnitudes, the two either equal one another or differ from one another. For any given property to be measured, someone must empirically establish that any two levels of the property are either equal or different. When the property in question does not allow for direct comparison, one must rely on indirect comparison. Indirect comparison requires showing that equal magnitudes correspond to equal measures within random error. Showing the above (4) requires, in turn, that one can establish equal magnitudes independently of equal measures, and then use them to compare the measures. In psychology, magnitudes do not permit precise manipulation and one cannot control systematic biases sufficiently well to complete this task. Therefore, psychological magnitudes cannot be measured.

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

Markus and Borsboom

3

Premise 1 states that measurement requires an assumption that we dub Half Hölder’s Axiom (HHA) for ease of reference: For any two magnitudes, either they equal one another or they differ. To illustrate this, Trendler gave the example of task motivation, where two people either have the same level of motivation, or they differ in motivation level. Premise 2 introduces the central claim that one must establish HHA empirically. The next section will address this premise in detail. One cannot directly observe task motivation (Premise 3), so one needs to manipulate it to a known value and compare the measurement to see that it takes the appropriate value plus or minus measurement error (Premise 4 and 5). However, it is not possible to manipulate motivation to a specified value (Premise 6). Consequently, we cannot measure motivation (Conclusion 7). The conclusion is subsequently generalized to the whole of psychology: that is, Trendler (2009) concludes that psychological measurement is impossible across the board. We applaud Trendler’s effort to develop the above argument and shed light on the prospects for measurement in psychology. However, given the far-reaching consequences of this conclusion for scientific psychology, the argument requires detailed scrutiny. In this respect, our reading of Trendler’s argument leaves us with several open questions described in the remaining sections. The first of these considers the importance of distinguishing what one can know or establish with reasonable warrant for belief from what holds true independent of what one can know or establish. The second explores further the prospects for empirical tests of HHA (Premise 2). The third section considers the plausibility of Premise 5, and the last section prior to the discussion considers the broader implications of Trendler’s conclusion. Taken together, these considerations suggest that measurement in psychology demonstrates a perplexing resilience to attempts to part ways with it, reminiscent of Mr. Johnson’s cat in Harry S. Miller’s song.

Measurability versus known measurability In tracing the line of reasoning in the above schematization of Trendler’s argument, the most immediately apparent gap between the premises and conclusion involves the drift from talking about what is testable to talking about what is true. If it remains possible to measure something without being able to demonstrate or provide sound evidence in support of the belief that one has the ability to measure it, then this gap between doing and knowing opens a chasm that blocks inference from the premises to the conclusion. For example, suppose that a ruler measures length on a ratio scale. Next, suppose that a ruler were dropped from the sky on an alien culture that had not yet invented such things. The new owners of the ruler may have no means to demonstrate that the ruler measures length, or to justify belief that it does, but may nonetheless still be able to measure length. At first blush, the same possibility would seem to hold for psychological measures. One means of attempting to bridge this chasm comes from the tradition of axiomatic measurement theory. Michell (1997) carved up the task of creating measurement instruments into the scientific task of establishing the quantitative structure of a given attribute (say, length), and the instrumental task of crafting a system that is able to assess this structure (say, a tape measure), and condemned psychology for claiming success in the instrumental task while neglecting the scientific one. Trendler (2009) accepts this adage

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

4

Theory & Psychology

(“Obviously the scientific task must always precede the instrumental task,” p. 582) and builds much of his argument around it. Although we have no quarrel with the logical order here (it is hard to defend the interpretation of an instrument’s measurement outcomes as indicative of a quantitative attribute if one cannot justify the supposition that the attribute in fact is quantitative), the interpretation of logical order as giving a norm for temporal order appears to us fallacious in this case, and cannot serve to bridge the chasm in Trendler’s argument. Hölder’s axioms have only been around for a century, and the axiomatic approach to representational measurement theory is even younger. The Egyptians did not have access to set theory, homomorphic mappings, the axioms of quantity, or any of the other mathematical structures that appear in the criticisms by Michell (1997, 1999) and Trendler (2009). Yet, they did build the Pyramids and they had measuring ropes and plum bobs (Root, 1989). Now, the Egyptians likely had an intuitive understanding of the composition of quantitative structure, as any ordinary human being does, but that understanding must have been quite implicit; we strongly doubt that any Egyptian ever sat down to lay out the empirical relational structure of a thousand different rods, or to prove representation and uniqueness theorems for mapping that structure into a numerical relational structure. Clearly, the Egyptians could construct structures that modern measurement theory would identify as standard sequences (e.g., based on a standard distance between knots in a measuring rope), and they knew very well how to use them. But the construction and appropriate use of such measurement instruments was apparently possible in the absence of any success in the “scientific task” (or even the awareness that there was one). What is important, then, is not the temporal precedence of actually solving the scientific task before completing the instrumental one, or even that the scientific task be completed somewhere in the future—should the world have collapsed before Hölder could state his axioms, the Egyptians surely would nevertheless have been involved in quantitative measurement. What matters is not that we know how to show that a given attribute is quantitative, but that, in point of fact, the attribute in question is quantitative. We are currently in a state of ignorance with respect to this issue insofar as it concerns many cases of putative measurement in psychology. It may very well be that our ignorance will be enduring, and that we will never find out in which cases we do and in which cases we do not have quantitative measures. However, this does not imply that we do not or will not have these measures. It merely means that we do not know how to show that we have them. Thus, unless one accepts a reduction of truth to verification (a notoriously problematic and marginal position that, in our view, is indefensible), Trendler’s argument at best achieves the impossibility of epistemic access to the truth of the measurement claim; not the impossibility of that claim holding true. Another possible approach to bridging this chasm comes from the tradition of psychometrics. Much validity literature emphasizes validity as justification for interpreting scores in a particular way. If one takes measurement as a social practice that requires acceptance as part of group norms in order to function, then this makes it more plausible that measurement rests on justification. Suppose that the alien culture imagined above has a market where merchants sell some commodity by length. If nobody involved believed that a ruler could be used to measure length, then it would not be possible to use the ruler to measure length as part of the social practice of selling the commodity

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

Markus and Borsboom

5

in question—not even if the ruler can measure length. However, repairing Trendler’s argument along these lines would require a detailed analysis of why this view holds in all cases to which the argument applies. The step across the chasm remains far more than trivial even by this approach.

What counts as disconfirming evidence? As identified above, Trendler’s argument rests on the alleged inability to verify or falsify HHA. It is not clear, however, what the truth conditions of HHA are, whether these are logical or empirical in nature, and (if they have empirical content, as Trendler appears to assume) what their falsification instances are. Citing Michell (1999), Trendler (2009) phrases HHA as “any two magnitudes of the same quantity are either identical or different” (p. 582). This follows from an axiom given by Hölder (1901, see also Michell & Ernst, 1996, 1997), the first half of which states that for any two magnitudes of a common quantity, a and b, either a = b or a ≠ b. The second half of the axiom states: “and if the latter, then one is always greater than the other” (hence our reference to HHA as Half Hölder’s Axiom). Recall that Premise 2 requires empirical support for HHA, and the full argument turns crucially on this premise. This is because the conclusion against the possibility of quantitative measurement rests decisively on the practical impossibility (but not the logical impossibility) of collecting such evidence. Assuming the standard (weak) interpretation of “or,” however, HHA remains consistent with three of four possibilities: (1) the magnitudes equal one another and do not differ, (2) the magnitudes do not equal one another and differ, or (3) the magnitudes both equal one another and differ from one another. Only case (4), in which the magnitudes neither equal one another nor differ from one another, offers possible disconfirming evidence for HHA. What might count as disconfirming evidence (for brevity, ~HHA), depends upon how one interprets HHA. The remainder of this section considers two broad classes of interpretations different with respect to whether one takes HHA as a tautology or not. We argue that neither of these interpretations buys Trendler the desired conclusion.

Tautological interpretations of HHA The statement of the axiom in terms of quantities seems unfortunate because it invites conflating the quantity with the measure of it in a way that threatens the argument. If one takes equality as numeric equality and difference as numeric inequality, then HHA appears equivalent to stating that two magnitudes of the same quantity either equal one another or do not equal one another. However, this assertion constitutes a substitution instance of P or Not P, the law of the excluded middle. This law holds true axiomatically in any logico-mathematical system that assumes bivalence (i.e., in which all well-formed formulae have one of two truth values, true or false). As a logical tautology, HHA therefore follows trivially from any premise and holds true as a matter of logical necessity. Under this interpretation, the logical possibility of empirical evidence disconfirming HHA requires that one reject bivalence.

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

6

Theory & Psychology

Although it renders Trendler’s argument somewhat implausible, some interesting aspects make this interpretation worth briefly exploring. Perhaps the best known instance of rejecting bivalence involves intuitionist theories of mathematics in which a theorem does not become true until derived and does not become false until the derivation of its negation. This requires a third truth value for theorems which are neither true nor false (Dummett, 1993). Applied to measurement, this would suggest that two magnitudes neither equal one another nor do not equal one another until their equality has been assessed, presumably empirically. This would make sense in a situation in which the measurement operation assigned magnitudes that did not apply to the entities in question until after the completion of the measurement operation. For example, assigning unique identification numbers to participants in a study might offer an instance in which the numbers stipulate a value rather than being derived from a pre-existing property. Understood this way, HHA asserts that measurement applies to some pre-existing magnitude rather than to operations which introduce values de novo. Such an interpretation of the axiom seems to comport well with typical informal assumptions about measurement. A second way to violate bivalence involves referentially defective discourse, assigning truth values to assertions that somehow fail to successfully refer to part of what they describe. In this case, one might deem the assertion that a = b as neither true nor false because it somehow fails to refer to something that unambiguously renders it true or false. An example would be a vague predicate, one that simultaneously applies to several different precise meanings. For example, if one measured tallness rather than height, tallness might refer to anyone over a certain specified height providing a zero point such that individuals still taller had positive tallness and individuals below that height had negative tallness. However, the precise zero point might fall anywhere within a certain range of values, making the property of tallness vague. In that case, one possible precise meaning of tallness might render a = b but other possible meanings would render the same a unequal to other possible b values, and thus one might therefore deem a = b as neither true nor false. (If one deemed it both true and false, this would not disconfirm HHA.) In this case, HHA might be taken as ruling out vague predicates in measurement, which again seems to comport with informal intuitions about measurement. Another type of referential failure involves referring to things that do not exist. For example, if 1,000 people take a test, then the assertion that the score of the 1,000th person equals that of the 1,001st person fails to successfully refer, and thus one might deem it neither true nor false. However, the existence of the individuals measured seems normally assumed and not contentious in typical measurement contexts. So, the relevant sort of failure here would seem to be the case of agreed-upon individuals who fail to have the measured property. Indeed Hölder’s fourth axiom rules out such cases by requiring that the sum of two magnitudes always exceeds each of the magnitudes (Michell & Ernst, 1996, p. 238).1 This precludes non-positive values and is termed “positivity” by Krantz et al. (1971, p. 73), who consider it optional. In this case, counterexamples are easy to produce, but it seems very implausible that all magnitudes measured in psychology require that no-one can have none of the attributes (no knowledge of algebra, no motivation to succeed in therapy, no prior training in a particular job task). Examples prove elusive if one drops the positivity axiom and takes zero as none. Does the number of toes on a cube equal the number of toes on a sphere? Neither have toes,

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

Markus and Borsboom

7

but yes, they both seem to have zero toes. One generally assumes that mental entities lack volume. Does the volume of a memory equal the volume of a belief? If they both have zero volume, then the answer appears affirmative. If the idea of volume simply cannot apply, then perhaps they are neither equal nor not equal. Intuitions seem less clear in this case, but again, HHA seems to require something plausible. In each of these interpretations, however, disconfirming empirical evidence is hard to imagine. What data would show that a = b fails to hold either true or false before the measures are taken? What data would show that a = b fails to successfully refer in a way that would make it true or false? Premise 2 asserts that one must provide empirical support for HHA, but in these instances, it seems entirely unclear that this offers the best strategy to settle these issues. They seem much more amenable to formal analysis of the logical structure of the assertions involved. In any event, Trendler’s (2009) explanation certainly does not make it clear to the reader what sorts of concrete examples might satisfy Premise 2 under such a reading. One thing that seems to make such disconfirming evidence elusive is the fact that collecting disconfirming evidence seems to require that one render the situation sufficiently clear and precise so as to satisfy the axiom.

Non-Tautological Interpretations of HHA Given that Trendler’s argument turns on empirical evidence against HHA, it seems best to interpret HHA as something other than a tautology. Tautological interpretations stem from interpreting “a differs from b” as meaning that a does not equal b. The statement of HHA in terms of magnitudes of the same quantity encourages such a reading. However, a more plausible reading results from instead interpreting HHA as referring to reported equality and difference judgments of some sort. For example, Krantz et al. (1971, p. 17) give the example of an individual reporting preferences for a series of objects and encountering a pair of objects for which he or she experiences neither a “strict preference” nor “indifference.” In such cases, one establishes equality (indifference) or difference (preference) by independent means from one another. Thus, both can fail (~HHA) without violating bivalence. Clearly separating the empirical phenomenon from the numerical representation aids clear discussion of this type of ~HHA evidence. Otherwise put, the magnitudes referred to in HHA refer not to the numeric quantities used to represent the quantity measured, but rather to the results of some empirical procedure for generating these numeric quantities. Thus, the empirical procedure can fail HHA even when the numeric quantities do not. The above example and other behavioral science examples found in Krantz et al.’s (1971) presentation of conjoint measurement have the distinctive characteristic of involving a judgment or other implicit scaling produced by the individual(s) under study. This situation seems to differ importantly from the typical situation in psychological measurement in which the researcher seeks to scale individuals according to some property. The two appear independent of one another. One could imagine a situation in which individuals fail to quantitatively scale preferences, but researchers can quantitatively scale individuals in terms of their preference for a particular item. Krantz et al. give the example of an individual who always takes one factor as decisive over another, and only uses the second factor to decide between ties on the first factor. Such an individual

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

8

Theory & Psychology

violates the axioms of conjoint measurement in his or her preference ratings because no finite number gives the ratio between the effect of the first factor and that of the second on overall preference. However, a group of individuals following such rules individually could nonetheless produce a differential preference structure across individuals that remains consistent with quantitative scaling practices. Presumably, this would not involve conjoint measurement, however, given that no apparent reason compels the researcher to construct the overall preference magnitudes from an additive combination of separate preference criteria. If interested in whether individuals represent properties quantitatively, one might obtain these by asking the right questions. If interested in measuring individual differences, independent of how individuals might represent those differences to themselves, then it seems less than clear what sort of empirical data could provide disconfirming evidence in this form. Trendler points to a solution involving manipulating the intended attribute to a fixed value, and then demonstrating that the measurement procedure produces the same value within measurement error. This procedure seems to assume HHA rather than test it. Certainly, it would provide useful disconfirming evidence to discover that manipulating a to the same value as b produces different measurement results when applying the same measurement method to a and to b. However, this procedure seems to provide a test of the assumption that a equals or exceeds b if and only if the result of measuring a equals or exceeds the result of measuring b, rather than providing a direct test of HHA. The means of empirically testing HHA remain obscure under both tautological and non-tautological interpretations, although the latter appears more promising. Certainly, it would be helpful for any who see a clear path to such empirical tests to provide concrete examples. The above considerations at best provide a rough guide to what such examples would need to show. This rough guide has not yielded any categorical conclusions, except that the exact nature of an empirical test of HHA remains unclear. Our diagnosis is that Trendler’s argument walks a fine line in attempting to argue that such empirical evidence is both necessary and impossible, given that necessity normally requires possibility. Essentially, the argument requires that empirical tests are logically possible but practically impossible. The first half of this requirement, asserted by Premise 2, requires further explaining. Perhaps the best available conclusion repeats the warning given by Krantz et al. (1971) that “One must keep in mind the fact that the refutability of axioms depends both on their mathematical form and on their empirical interpretation” (p. 29).

Does Trendler’s argument prove too much? Slicing Hölder’s (1901) first axiom up in two halves has a further problematic consequence, namely that it would seem to preclude any kind of structured observation. The full axiom clearly applies to quantitative properties; but HHA only applies to the assessment of (in)equality. However, the assessment of (in)equality in itself is not specific to quantitative structure; for constructing equivalence classes through the equivalence relation (~) is fundamental to every measurement system, including the nonmetric systems involving unordered categorical (i.e., nominal) or ordered categorical (ordinal) representations of attributes. Trendler (2009) notes that, to establish the impossibility of measurement in psychology, “[i]t will be sufficient to focus solely on the first condition of

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

Markus and Borsboom

9

quantity in order to decide if psychological attributes are measurable, since it is a necessary (though of course not sufficient) requirement for the identification of quantitative structure” (p. 583). However, since establishing equality is a necessary condition for categorization just as well as it is for quantitative representation, the upshot of Trendler’s argument is that, if correct, the argument rules out the possibility not only of quantitative measures, but of nominal ones as well. It would seem to us that it is a quite heavy conclusion to draw. We appear to be able to make truthful judgments of equality in at least some cases of observation in psychology. That is, we may consider the property of knowing the names of all four authors of the Foundations of Measurement trilogy. There are people who do and people who do not know the names of the authors in question, and one might actually venture the hypothesis that we could find out by asking them. It appears to us that the construction of equivalence classes in this case is not impossible at all. In fact, if obtaining equivalence classes of any type were somehow mysteriously precluded in psychology, Trendler’s problem would preclude not only measurement, but structured observation and classification as well. Perhaps, however, we should interpret the situation differently. For instance, one may put forward the alternative claim that the assessment of equality is in fact possible in psychology, but not for quantitative attributes. It is not clear how Trendler’s argument could survive such reinterpretation, because if one can assess equality, one can construct nominal representations, and if one can construct nominal representations, then the first part of Hölder’s axiom is satisfied, and if the first part of Hölder’s axiom is satisfiable, then the problem with psychological measurement must lie elsewhere in his axiomatization, namely in establishing orders and equivalence classes of differences. Thus, Trendler’s argument cannot apply to the situation so interpreted. In addition, the idea that only quantitative attributes pose difficulties in setting up equivalence classes presumes that such attributes do in fact exist (we were under the impression that this was actually one of the main contended issues in the discussion on psychological measurement), which implies the possibility that they are in fact measurable, be it, perhaps, by other means than those known from the physical sciences— maybe even by using psychometric models. Clearly, if Trendler’s argument shows the impossibility of establishing equivalence, then this either precludes categorization in psychology across the board, or it must somehow apply exclusively to quantitative psychological attributes. The first conclusion is, we think, too strong, but the second finds no justification in Trendler’s (2009) article.

Can one test scale properties by other means? Trendler’s argument presents HHA as the key element that prevents measurement of psychological properties. However, the test Trendler describes does not test HHA but rather tests that one can manipulate the property to a knowable quantity that the measure can then detect within measurement error. For example, one could use a Celsius thermostat to heat a room and a Fahrenheit thermometer to measure the room temperature. This set-up would fail Trendler’s test because the manipulated value does not match the measured value. In order to pass the test, one would have to either match the scales of the two devices, or else convert the values from one to the scale of the other. Nonetheless, both

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

10

Theory & Psychology

Celsius and Fahrenheit scales conform to HHA. If one focuses on what the proposed test in fact tests, rather than HHA, one can ask whether one could test this by other means. If so, then a gap opens in Trendler’s argument because showing that one cannot establish the scale of measurement using the test he described does not rule out the possibility of establishing it by other means, and this would bypass the hurdle presented in his argument against psychological measurement. Trendler’s test rests on the availability of only one measure and the possibility of manipulating the quantity in question precisely. To set the scale, Trendler therefore proposes manipulating the quantity to a known value and testing that the measure falls within measurement error of that value. In most cases of psychological measurement this is indeed impossible, either because we have no experimental means of controlling the properties that we attempt to measure or, as Trendler describes, we can manipulate the property but not to a precise value. However, it would seem unlikely that this provides a categorical reason why psychological measurement is impossible; many properties in the natural sciences are resistant to manipulation (the age of the universe, the structure of a string of DNA, the rise of oceanic temperature), yet it appears possible to measure them nevertheless. In cases where experimental manipulation is impossible, measurement procedures typically rest on triangulation of multiple measures that are aimed at assessing the same property. If one has multivariate information, then one can check that they share the same scale without resorting to manipulation. For example, without using a thermostat and furnace to heat a room, one can take the ambient temperature as one finds it and compare the readings of two thermometers using the same scale (e.g., both Celsius). If they match, then the odds are very good that they both measure temperature on the same scale. Of course two broken thermometers that always register the same value will pass this test, so one would want to repeat it in rooms of varying temperature. The point is that one never needs to manipulate the room temperature, or know the temperature to which one heats the room. One only needs to compare two independent measures of the same quantity measured on the same scale. If one can do something similar in the context of psychological measurement, for example by triangulating on a common property with multiple measures, then one can bypass the need for precise manipulation in order to achieve psychological measurement. More broadly, we think that this is precisely the method on which general psychometric modeling procedures rest. These procedures test whether the association structure in a multivariate space that arises from obtaining multiple measures of the same property is in accordance with the hypothesis that the observed variables in fact measure the same thing. The conditions required to identify such models, to fit them to empirical data, and to achieve various scale levels, have been studied extensively in the psychometric literature (e.g., Hambleton & Swaminathan, 1985; Lord & Novick, 1968; Rasch, 1960; van der Linden & Hambleton, 1997).

A general framework for evaluating arguments against measurement Arguments against assumptions that practicing psychologists and psychometricians appear to take for granted in psychological measurement have a long and varied history

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

Markus and Borsboom

11

(see, e.g., Bergson, 1913; Kyngdon, 2008; Michell, 1997, 1999, 2000; Reese, 1943; Stevens, 1946). No doubt, Trendler’s (2009) paper will not be the last in this line of descent. This section first provides a general framework for evaluating such arguments. This framework will be of assistance to authors in constructing and presenting such arguments, and also to reviewers and readers in general in evaluating them. It also has theoretical interest in and of itself as a means of formulating a modicum of meta-theory concerning such arguments. A very general strategy for arguing the impossibility of psychological measurement involves the following: (a) fix the intended sense of measurement, (b) identify a necessary condition for the intended sense of measurement, (c) show that something in the nature of psychological properties (which may or may not be magnitudes) precludes this necessary condition, and (d) conclude that measurement of psychological properties is not possible on this basis. The framework that follows assumes this basic argument structure and applies broadly to any argument adopting this general strategy. A successful argument should fulfill the following requirements: 1. 2. 3. 4. 5. 6. 7. 8.

The argument clearly specifies an intended notion of measurement. The argument clearly specifies a necessary condition for the intended notion of measurement. The condition is indeed necessary. The condition is not necessary for things that the argument should not rule out. The argument clearly specifies a quality of psychological properties that conflicts with the specified necessary condition. The argument establishes that the conflicting quality is a universal quality of all psychological properties. The argument establishes that the proposed quality universally precludes satisfaction of the necessary condition. The argument avoids any equivocation on the above key elements as it moves from the necessary condition to the quality incompatible with the necessary condition, to the conclusion that measurement is not possible in psychology because all psychological properties share the quality incompatible with measurement.

For illustrative purposes, we apply these requirements to Trendler’s argument. 1. 2.

3.

Trendler’s argument fares well in this respect by specifying the intended notion of measurement both axiomatically and verbally. Trendler’s argument introduces an ambiguity between satisfying HHA and providing an empirical means to confirm satisfaction of HHA. The ambiguity in the necessary condition makes it hard to pin down the applicability of the necessary condition throughout the remainder of the argument. If one takes HHA as the condition, then it is uncontroversially necessary. If one takes testability as the condition, this step becomes less certain. If one takes testability through the specified procedure as the condition, then it appears clearly unnecessary.

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

12

Theory & Psychology 4. 5. 6. 7. 8.

The HHA condition is necessary for a good deal that Trendler’s argument does not seek to rule out (names, categories, orders of psychological properties). The conclusion for the other two candidates for the necessary condition remains unclear. Trendler’s argument fares well with respect to clearly specifying the ability to manipulate a property to a known or fixed value. However, it provides little evidence in support of the conclusion that all psychological properties share this quality. In addition, it does not fare well in showing that this is universally in conflict with the necessary condition. The ambiguity in the necessary condition opens the door to equivocation between different interpretations at different stages of the argument, and therefore the argument is inconclusive.

Conclusion Trendler’s analysis of the measurement problem in psychology is useful, because it reminds us that the term measurement is loaded with semantics from outside psychology (i.e., the natural sciences), and that the conditions required to use such semantics are neither trivial, nor well confirmed in psychology. The current practice in psychology, to use the term measurement for any observational procedure that produces numeric information, may therefore import too many unjustifiable assumptions. In addition, Trendler has drawn attention to an important obstacle that blocks the systematic construction of measurement apparatuses by following the same strategy as has been fruitfully used in the natural sciences, namely that it is unlikely that psychological variables can be made to depend on a small number of manageable conditions. Even here, however, one should be careful in avoiding all too sweeping generalizations: the likelihood of strategies analogous to those of physics succeeding for, say, personality is remote, but may be less problematic for basic tasks in psychonomics, for instance (e.g., two-choice response tasks, memory tasks). However, taking note of these obstacles does not force one to subscribe to the thesis that measurement in psychology is impossible, because this strong conclusion does not follow from the premises of Trendler’s argument. In general, we think that the prospects for a knock-down argument against the possibility of measurement in psychology are not good. First, such an argument must rely on showing that some previously followed methodological scheme (i.e., the construction of measurement instruments in physics) is impossible in psychology. This in itself is difficult, as psychology is a large field in which numerous diverse research topics are studied; to make sweeping generalizations across all of these topics would appear inadvisable. Second, such an argument will have to assume that future investigations must necessarily conform to the logic of measurement as it presumably applies to physics. It therefore discounts the possibility that scientists will, at some point, concoct new ways of achieving their goals, which may or may not be structurally analogous with the cases studied by Trendler. This requires a strong reliance on the assumption that future developments in psychology are predictable, which is not a wise assumption to make (e.g., Popper, 1961). It also discounts the possibility that the scientific community will change its conceptual scheme so as to alter the

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

Markus and Borsboom

13

meaning of the term measurement itself. Both these possibilities would seem too realistic to discount on a priori grounds. It is important to stress that we do not aim to defend the thesis that currently existing tests and other observational practices can be defensibly interpreted as measurement instruments in the strong sense of measurement intended by Trendler (2009) and Michell (1997, 1999). We are agnostic about this issue. On the one hand, a century of psychological testing has shown that psychological scales may display stable characteristics (e.g., the positive manifold of intelligence, reliable correlation patterns in personality tests, etc.). These characteristics suggest that the items and subscales included in such tests bear systematic connections to each other, which clearly suggests that something is going on in the data—that is, psychological scales are not random collections of items. A possible explanation for such facts is that the selected items and subscales causally depend on the same attribute (e.g., general intelligence, the Big Five personality factors), and if that is so, we think that standard psychometric techniques could justifiably subserve the interpretation of test scores as measures. On the other hand, however, it is in many cases far too early to decide whether the hypothesized dependence on a putative latent structure is justifiable, so that the interpretation of test scores as measures depends heavily on hypotheses that are currently in doubt. In addition, the epistemic problems noted by Trendler (2009) render monumentally difficult the task of showing that psychological attributes are quantitative, measurable properties that are in fact picked up by psychological tests. Thus, even though the prospects for a knock-down argument against psychological measurement are limited, the prospects for the positive justification of measurement interpretations are limited as well. More broadly, Trendler’s argument illustrates the challenges faced in producing a sound argument against psychological measurement. It adopts a general strategy for constructing such arguments that could easily provide a basis for a sustained program of research into this topic. We hope that this contribution to the literature does not get lost in the specific criticisms of the argument itself. We have outlined the basic structure of the strategy followed, and have provided a list of requirements that may serve the critical evaluation of such arguments, both in order to encourage further work along these lines. Whether or not psychological measurement is possible, careful detailed investigation into reasons to believe that it is not will almost certainly prove enlightening and advance the field. Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Note 1.

Indeed one cannot merely drop this axiom in Hölder’s system because his second axiom requires that “[f]or every magnitude there exists one that is less” (Michell & Ernst, 1996, p. 238). This entails that if magnitudes have a lower bound, then all magnitudes are greater than this lower bound, and the lower bound itself (e.g., zero) is not a possible magnitude.

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

14

Theory & Psychology

References Bergson, H. (1913). Time and free will: An essay on the immediate data of consciousness. London, UK: G. Allen. Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440. Cliff, N. (1992). Abstract measurement theory and the revolution that never happened. Psychological Science, 3, 186–190. Dummett, M. (1993). The seas of language. New York, NY: Oxford University Press. Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer Nijhoff. Hölder, O. (1901). Die Axiome der Quantität und die Lehre vom Mass [The axioms of quantity and the theory of measurement]. Berichte über die Verhandlungen der Königlich Sächsischen Gesellschaft der Wissenschaften zu Leipzig, Mathematisch-Physische Classe, 53, 1–64. Krantz, D.H., Luce, R.D., Suppes, P., & Tversky, A. (1971). Foundations of measurement (Vol. 1). Mineola, NY: Dover. Kyngdon, A. (2008). The Rasch model from the perspective of the representational theory of measurement. Theory & Psychology, 18, 89–109. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley. Michell, J. (1990). An introduction to the logic of psychological measurement. Hillsdale, NJ: Erlbaum. Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88, 355–383. Michell, J. (1999). Measurement in psychology. Cambridge, UK: Cambridge University Press. Michell, J. (2000). Normal science, pathological science, and psychometrics. Theory & Psychology, 10, 639–667. Michell, J., & Ernst, C. (1996). The axioms of quantity and the theory of measurement (part I). Journal of Mathematical Psychology, 40, 235–252. Michell, J., & Ernst, C. (1997). The axioms of quantity and the theory of measurement (part II). Journal of Mathematical Psychology, 41, 345–356. Popper, K.R. (1961). The poverty of historicism. New York, NY: Harper & Row. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Paedagogiske Institut. Root, M.M. (1989, July). Egyptian surveying tools. Backsights Magazine, 7. Reese, T.W. (1943). The application of the theory of physical measurement to the measurement of psychological magnitudes, with three experimental examples. Psychological Monographs, 55, 6–20. Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 667–680. Trendler, G. (2009). Measurement theory, psychology and the revolution that cannot happen. Theory & Psychology, 19, 579–599. van der Linden, W.J., & Hambleton, R.K. (1997). Handbook of modern item response theory. New York, NY: Springer. Keith A. Markus is Professor of Psychology at John Jay College of Criminal Justice of The City University of New York. One line of his research focuses on test validity, including the role of causation and empirical bases for normative conclusions. Another focuses on the substantive interpretation of statistical models as expressions of theory with special attention to causal interpretations. Address: Psychology Department, John Jay College of Criminal Justice, The City University of New York, 445 W59th Street, New York, NY, 10019 USA. [email: [email protected]]

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

Markus and Borsboom

15

Denny Borsboom is Associate Professor at the University of Amsterdam. His work has focused on conceptual analyses of psychometric models; topics include the theoretical status of latent variables, the concept of validity, the definition of measurement in psychology, and the relation between different test theoretic models. Address: Department of Psychology, Faculty of Social and Behavioral Sciences, University of Amsterdam, Roetersstraat 15, 1018 WB Amsterdam, The Netherlands. [email: [email protected]]

Downloaded from tap.sagepub.com at UVA Universiteitsbibliotheek on November 23, 2011

Theory & Psychology

Oct 14, 2011 - In psychology, magnitudes do not permit precise manipulation and one cannot ..... imagine. What data would show that a = b fails to hold either true or false before the measures are taken? ..... The seas of language. New York ...

165KB Sizes 1 Downloads 114 Views

Recommend Documents

Theory & Psychology
Aug 15, 2009 - If we apply the above analyses to an IQ test, say the Wechsler Adult Intelligence ...... of test scores and whether they fit within a theoretical nomological network. .... The social constructionist movement in modern psychology.

Theory & Psychology
Feb 23, 2008 - All rights reserved. Not for commercial ... student identities, even their identities in the classroom, of course, depend on their lives in other ... psycho-technology, it also leads social science on its Pythagorean quest. It is there

Theory & Psychology
which differentiate it from the relations which may possibly hold between numbers and ..... These axioms may not hold for any given set of data. That the Rasch ...

Theory & Psychology
urement, but also incorporates probabilities—so that we may deal with ... from the RMT perspective, in order to call something conjoint measurement one has to ...

Download Educational Psychology: Theory and ...
third-party eTexts or downloads. *The Pearson. eText App is available on. Google Play and in the App. Store. It requires Android OS. 3.1-4, a 7?? or 10?? tablet, or. iPad iOS 5.0 or later. 0133830853 /. 9780133830859 Educational. Psychology: Theory a

(PDF]DOWNLOAD Educational Psychology: Theory ...
Educational Psychology: Theory and Practice (10th Edition) By Robert E. Slavin ,Read PDF ... All online resources are supported by assessments and are highlighted at the ... Proven and Promising Programs for America's Schools, Two Million ... again r

Cognitive Psychology Meets Psychometric Theory - Semantic Scholar
sentence “John has value k on the ability measured by this test” derives its meaning exclusively from the relation between John and other test takers, real or imagined (Borsboom, Kievit, Cer- vone ...... London, England: Chapman & Hall/CRC Press.

Cognitive Psychology Meets Psychometric Theory - Semantic Scholar
This article analyzes latent variable models from a cognitive psychology perspective. We start by discussing work by Tuerlinckx and De Boeck (2005), who proved that a diffusion model for 2-choice response processes entails a. 2-parameter logistic ite