Improving Construct Validity: Cronbach, Meehl, and ...

Viewer
Transcript

Psychological Assessment 2005, Vol. 17, No. 4, 409 – 412

Copyright 2005 by the American Psychological Association 1040-3590/05/$12.00 DOI: 10.1037/1040-3590.17.4.409

Improving Construct Validity: Cronbach, Meehl, and Neurath’s Ship Drew Westen

Robert Rosenthal

Emory University

University of California, Riverside, and Harvard University

Smith’s article “On Construct Validity: Issues of Method and Measurement” is a fine tribute to L. J. Cronbach and P. E. Meehl (1955) that clarifies the current state and future directions in the understanding of construct validity. Construct validity is a dynamic process, and fit indices need to be used at the service of understanding, not in place of it. The failure of a study or set of studies to support a construct, a measure, or the theory underlying it admits of many explanations, and the ways scientists interpret such failures are prone to cognitive biases and motivated reasoning. This suggests why metrics designed to index the extent to which observations match expectations can be useful prostheses to scientific judgments. As P. E. Meehl (1954) showed decades ago, quantitative, statistical formulas and indices tend to outperform informal, qualitative judgments, and this applies as much to the way researchers evaluate constructs and measures as to judgments in the consulting room. Keywords: construct validity, multitrait–multimethod matrix, fit indices, personality, assessment

Greg Smith’s article “On Construct Validity: Issues of Method and Measurement” is a superb tribute to Cronbach and Meehl (1955) on the 50th anniversary of their classic paper on construct validity, offering a clearly written, thoughtful retrospective and prospective account of the legacy of this classic work that is central to virtually everything we do in psychology. Like Cronbach and Meehl, he emphasizes that construct validation is a dynamic process, one that involves considerable bootstrapping and one that is never finished. Implicit in his description of construct validation and efforts to refine this construct itself is the psychological tension we all experience between needing to use constructs and measures, and hence needing to treat them as relatively valid and reliable (in colloquial terms, “real”), while recognizing that they are always in flux, and that our current analyses are always built to some extent on quicksand (or what Vaihinger called “convenient fictions”). Of particular interest is Smith’s description of the myriad possible explanations for failures in prediction that may or may not ultimately threaten the validity of a construct (or measure of that construct). The theory might be wrong, the measurement strategy might be wrong, the construct might be poorly specified, the item content might be problematic even if the construct is valid, the network of theories in which the construct is embedded might require qualification in particular domains (e.g., across cultures), or the auxiliary theories (or methods) required to test a crucial hypothesis might themselves be in error (and hence lead to per-

ception of falsification when it is the auxiliary theory, not the central theory under investigation, that is problematic). This is, of course, where the mischief can come in: When we know there is a problem, all we know is that the problem might lie in any of a number of places, giving us too many “degrees of freedom” to pick our favorite culprit. Kuhn (1962) and others rejected purely empiricist (justificationist) philosophies of science on the grounds, well documented in the history of science, that when results come in that are not to scientists’ liking, scientists are more likely to attack the method or the messenger than to consider their entrenched beliefs falsified. That is why, as imperfect as these procedures and metrics are, we develop techniques such as effect size estimates that can be meta-analyzed, fit indices that can be compared across studies, and so forth, to rein in the fancies of motivated minds. In a common misunderstanding of the clinical–statistical prediction debate, psychologists often believe that what Meehl (1954) and others (e.g., Grove, Zald, Lebow, Snitz, & Nelson, 2000) have shown is that there is something peculiarly defective about the minds of clinicians. This reflects a widespread confusion between two meanings of clinical, one referring to the nature of the observer (clinicians vs. lay or other observers), which Meehl did not mean, and the other pertaining to the mode of aggregating observations (informal vs. quantitative), which Meehl did mean (see Westen & Weinberger, 2004). What Meehl argued, and what the data of the last 50 years support, is that experts of any sort— including researchers—who rely on subjective estimates and eyeballing of complex patterns will generally do worse than those who aggregate their data quantitatively. We would never, for example, recommend that anyone try to aggregate the vast literature on empirically supported therapies for depression intuitively because of limitations imposed by both cognition (e.g., limits of working memory in performing multivariate statistics mentally) and motivation (desire for one or another outcome to be true). Meta-analysis is particularly useful in this regard, because it constrains what we can believe, argue, and rationalize to ourselves and likeminded souls. That does not mean

Drew Westen, Department of Psychology and Department of Psychiatry and Behavioral Sciences, Emory University; Robert Rosenthal, Department of Psychology, University of California, Riverside, and Department of Psychology, Harvard University. Preparation of this article was supported in part by National Institute of Mental Health Grants MH62377 and MH62378 to Drew Westen. Correspondence concerning this article should be addressed to Drew Westen, Department of Psychology, Emory University, 532 Kilgo Circle, Atlanta, GA 30309. E-mail: [email protected] 409

410

WESTEN AND ROSENTHAL

we would uncritically accept the results of a meta-analysis (particularly if it challenged our own overvalued position), but metaanalytic data are wonderful prostheses for, and constraints on, cognitively limited and motivationally directional minds. It is remarkable how scientists approvingly cite meta-analytic data until they conflict with their beliefs, at which point they start talking about the limits of meta-analysis as a procedure (see Wampold, 2001). Smith’s description of three programs of research as models of advances in clinical assessment provides a useful set of case studies on the costs and benefits of quantifying goodness of fit in pursuit of construct validity. All three of these exemplary programs of research— on positive and negative affect as hierarchically higher order affect/personality variables, on the five-factor model and related models of hierarchically organized personality traits, and on externalizing spectrum pathology— have relied extensively on fit indices of one sort or another to bolster their claims. Indeed, part of what has been so convincing about each of these approaches has been the use of confirmatory factor analysis and similar techniques (e.g. Procrustes rotations) to compare competing models (or in the case of Krueger’s work, the use of Bayesian information characteristics that can provide subtle comparisons of different kinds of models). As the major contributors to each of these programs of research know, of course, fit indices are not foolproof, and as a field, we need to beware of replacing our previous idolatry of p values with an idolatry of fit indices (see Tomarken & Waller, 2003). It is quite possible for fit indices to support a hierarchical solution supporting one higher order factor with correlations of .90 among the lower order factors, which is hardly a parsimonious solution from a conceptual standpoint. It is similarly possible that a six-factor model will prove a better fit than a five-factor model for a given data set, even though the sixth factor is neither conceptually coherent nor particularly informative, simply by virtue of the item content. Ultimately, we should be influenced by the numbers, not paint by them. One of the most important limitations of all fit indices is that they cannot address whether the choice of items, indicators, observers, and so forth was adequate to the task. Thus, their meaning and use are always contingent on a broader purview of the relevant constructs, measures, and their history. For example, none of the three research programs Smith (appropriately) considers exemplary has yet convincingly addressed the problem of method variance attributable to observer effects, in large measure because the Zeitgeist (and the advantage of self-report questionnaires in generating the Ns required for the use of most fit statistics) has not led to the kind of critical examination in this regard that Smith describes as key to construct validation (and science in general). To put it another way, these programs of research have taken Cronbach and Meehl (1955) seriously, but they have perhaps not taken Campbell and Fiske (1959) seriously enough. (Krueger’s work in this respect is too recent to consider this a serious objection, although he will need to tackle the issue down the road, given the evidence that self-reports of externalizing pathology correlate only minimally with aggregated informant reports, which may be more predictive longitudinally of real-world criterion variables; Fiedler, Oltmanns, & Turkheimer, 2004). Indeed, the vast majority of the studies supporting these three research programs has relied exclusively on self-reports, and where researchers have tested

self-informant convergence, those correlations have tended to account for 5% to 25% of the variance. This would be a hefty chunk of the variance if we were correlating measures of two constructs that we believed to be related in some interesting way, but it suggests that trait variance is accounting for far less of the covariation between two purported measures of the same construct than some other, unstudied set of variables— despite (or perhaps reflected in) high internal consistency of each measure taken separately. For example, self-reported neuroticism inherently confounds neuroticism, self-perception of neuroticism, willingness to admit neuroticism to oneself, and willingness to admit neuroticism on a questionnaire (self-presentation). These are four very different constructs. As Smith suggests, methods of parsing variance (in this case, method variance, or informant effects) derived from structural equation modeling could prove very useful in this respect if researchers were routinely to include aggregated informant data along with self-reports, which an emerging body of evidence suggests is essential in personality disorder research (Clifton, Turkheimer, & Oltmanns, 2005; Klonsky, Oltmanns, & Turkheimer, 2002). However, sample size requirements for most of these analyses render them difficult to implement. Smith’s choice of negative exemplars (“not so good” measures and research programs) suggests why, despite the limits of current methods of quantifying construct validity, we need to keep trying to find ways of doing so. Rorschach indices (which have been too often equated with the Rorschach in general—a set of stimuli that can be neither reliable nor unreliable— or with the Exner, 2003 scoring system in particular) have indeed been the subject of withering attacks over the last decade, some well justified, some less so. Unfortunately, many of these attacks seem a bit selective, focusing on particular Exner indices while ignoring work on thought disorder using Holzman’s scoring system for thought disorder, for example, which has impressive evidence of reliability, validity, and incremental validity (e.g., Coleman, Levy, Lenzenweger, & Holzman, 1996). Perhaps the most important research in this area is one Smith does not cite, a nonpartisan meta-analysis comparing randomly selected Rorschach to Minnesota Multiphasic Personality Inventory indices, which found that the two instruments yield roughly equivalent effect sizes in predicting external criteria (Hiller, Rosenthal, Bornstein, Berry, & Brunell-Neuleib, 1999). Rorschach critics, of course, followed up immediately with post hoc critiques of the meta-analysis, which seemed to us far less convincing than what Smith describes as a Rorschach-friendly post hoc response, the claim that Rorschach responses measure implicit rather than explicit processes. (The “implicit measure” hypothesis seems to us odd to label as post hoc, given that the hypothesis that projective measures assess processes inaccessible to consciousness that are nevertheless expressed in perception, memory, and behavior was around long before current critics were born. Indeed, the whole point of projective tests from the start was to assess associative networks to which people may not have access; see Westen, Feit, & Zittel, 1999.) Neither of us is a particular advocate of Rorschach measures, but we suspect that availability of metrics for estimating construct validity might have given both Rorschach and MMPI researchers a better idea some time ago whether they were achieving similar or different validity coefficients, on which scales, and

SPECIAL SECTION: IMPROVING CONSTRUCT VALIDITY

using which kinds of criterion variables (e.g., self-reports vs. behavioral observation). With respect to Smith’s discussion of the metrics we have proposed for such purposes, we appreciated his construction of just the kind of examples that help flesh out both the potential and the limits of such indices. As Smith suggests, and we would concur, fit indices are only useful when combined with a thoughtful examination of the hypotheses, constructs, and questions being asked. His examples are important in suggesting limits to blind application of our two metrics, just as blind application of contrast analysis in group comparisons, functional neuroimaging, and other domains in which it is widely used can lead to misinterpretations. As Smith notes, we did not provide a range of normative examples using our metrics, which we hoped our article would stimulate, and Smith’s examples are useful in this regard. The examples Smith provides seem to us to suggest both what these metrics can and cannot do in the presence or absence of just the kind of thoughtful, theory-grounded examination he recommends. In Column 5 of Table 1, he presents an example of what would happen if our measure of histrionic personality disorder (HPD) of adolescence, which we believed to be a hybrid construct vis-a`-vis the current diagnostic classification system but with histrionic features at its core, strongly correlated only with the Diagnostic and Statistical Manual of Mental Disorders, 4th edition (DSM–IV; American Psychiatric Association, 1994) histrionic diagnosis. The “alerting” coefficient is less useful with a sample of this size, for which the more precise estimates and confidence intervals of rcontrast-CV are likely to yield an accurate picture of the magnitude of the effect; hence, we will focus primarily on this metric. In this example, using his hypothetical data instead of our obtained data, the rcontrast-CV drops precipitously, from .72 to .39 (or from accounting for roughly 50% to 16% of the variance), which seems to us a good indication of the decrement in accuracy of the prediction. The hypothetical example in the sixth column also seems to us to make good sense, if we bear in mind both the network of hypotheses embedded in the predictions and the magnitude of the obtained correlations. From the point of view of our hypothesis— that we have developed a reasonable measure of a revised construct of HPD of adolescence—the observed correlations would indeed be discouraging. The percent of variance accounted for, reflected in rcontrast-CV, is less than one tenth of what we obtained when the observed correlations more closely matched our hypotheses, and the pattern of correlations would send us back to the drawing board. What the results reflected in the smaller but still significant rcontrast-CV suggest is that we do not have a valid measure of HPD of adolescence, but our measure does have some ability to detect personality styles that we hypothesized to be unlike this construct. This might prove useful in helping us rethink what our measure might actually be assessing. Smith’s final hypothetical example, in Column 7, is, at one level, the most challenging and, at another, a good illustration of why we should consider statistics of any sort (including fit statistics) as aids to understanding rather than as substitutes for it. Once again, if we had obtained the pattern of results depicted in Column 7, we would not have concluded that we had developed a valid measure of HPD of adolescence. We might well, however, have concluded that we were on the way to developing a valid measure of a broader dimension shared by the DSM–IV Cluster B (dramatic,

411

erratic) personality disorders and dependent personality disorder, such as unstable attachment relationships. This set of findings might, in fact, spur new thinking and research into what is shared by these disorders, which may well lead us to a more productive program of research and measurement. What we did not emphasize enough, we suspect, in our initial presentation of the two metrics is that the goal of construct validation research (and of metrics designed to quantify goodness of fit between predictions and observations in a multitrail–multimethod matrix) is not just to satisfy us that we understand a given measure or construct well enough to use it but to help us continue to refine it. In this sense, we would be as critical as Smith of any attempt to use our proposed metrics to settle the question of the construct validity of any particular measure of any particular construct. As he so eloquently argues, questions of construct validity are never settled in any final sense, and in this, the enterprise of construct validation provides a marvelous model of the scientific enterprise in general. Otto Neurath’s (1921, pp. 75–76) nautical imagery for the work of scientists fits construct validation equally well: We are as sailors who are forced to rebuild their ship on the open sea, without ever being able to start fresh from the bottom up. Wherever a beam is taken away, immediately a new one must take its place, and while this is done, the rest of the ship is used as support. In this way, the ship may be completely rebuilt like new with the help of the old beams and driftwood— but only through gradual rebuilding.

The metrics we have proposed are only intended to serve as better beams than those provided by vague impressions of construct validity. Better beams contribute toward the improvement of the ship, but they cannot themselves complete the rebuilding. As Smith wisely reminds us, that task is never finished.

References American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: Author. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait–multimethod matrix. Psychological Bulletin, 56, 81–105. Clifton, A., Turkheimer, E., & Oltmanns, T. F. (2005). Self- and peer perspectives on pathological personality traits and interpersonal problems. Psychological Assessment, 17, 123–131. Coleman, M. J., Levy, D. L., Lenzenweger, M. F., & Holzman, P. S. (1996). Thought disorder, perceptual aberrations, and schizotypy. Journal of Abnormal Psychology, 105, 469 – 473. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Exner, J. E., Jr. (2003). The Rorschach: A comprehensive system (4th ed.). New York: Wiley. Fiedler, E., Oltmanns, T., & Turkheimer, E. (2004). Traits associated with personality disorders and adjustment to military life: Predictive validity of self and peer reports. Military Medicine, 169, 32– 40. Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 12, 19 –30. Hiller, J. B., Rosenthal, R., Bornstein, R. F., Berry, D. T. R., & BrunellNeuleib, S. (1999). A comparative meta-analysis of Rorschach and MMPI validity. Psychological Assessment, 11, 278 –296. Klonsky, E. D., Oltmanns, T. F., & Turkheimer, E. (2002). Informantreports of personality disorder: Relation to self-reports and future research directions. Clinical Psychology: Science & Practice, 9, 300 –311.

412

WESTEN AND ROSENTHAL

Kuhn, T. (1962). The structure of scientific revolutions. Chicago: University of Chicago Press. Meehl, P. E. (1954). Clinical vs. statistical prediction. Minneapolis: University of Minnesota Press. Neurath, O. (1921). Antispengler (T. Parzen, Trans.). Munich, Germany: Callwey. Tomarken, A. J., & Waller, N. G. (2003). Potential problems with “well fitting” models. Journal of Abnormal Psychology, 112, 578 –598. Wampold, B. E. (2001). The great psychotherapy debate: Models, methods, and findings. Mahwah, NJ: Erlbaum.

Westen, D., Feit, A., & Zittel, C. (1999). Methodological issues in research using projective techniques. In P. C. Kendall, J. N. Butcher & G. Holmbeck (Eds.), Handbook of research methods in clinical psychology (2nd ed., pp. 224 –240). New York: Wiley. Westen, D., & Weinberger, J. (2004). When clinical description becomes statistical prediction. American Psychologist, 59, 595– 613.

Received January 11, 2005 Accepted April 19, 2005 䡲

New Editors Appointed, 2007–2012 The Publications and Communications (P&C) Board of the American Psychological Association announces the appointment of three new editors for 6-year terms beginning in 2007. As of January 1, 2006, manuscripts should be directed as follows: • Journal of Experimental Psychology: Learning, Memory, and Cognition (www.apa.org/journals/ xlm.html), Randi C. Martin, PhD, Department of Psychology, MS-25, Rice University, P.O. Box 1892, Houston, TX 77251. • Professional Psychology: Research and Practice (www.apa.org/journals/pro.html), Michael C. Roberts, PhD, 2009 Dole Human Development Center, Clinical Child Psychology Program, Department of Applied Behavioral Science, Department of Psychology, 1000 Sunnyside Avenue, The University of Kansas, Lawrence, KS 66045. • Psychology, Public Policy, and Law (www.apa.org/journals/law.html), Steven Penrod, PhD, John Jay College of Criminal Justice, 445 West 59th Street N2131, New York, NY 10019-1199. Electronic manuscript submission. As of January 1, 2006, manuscripts should be submitted electronically through the journal’s Manuscript Submission Portal (see the Web site listed above with each journal title). Manuscript submission patterns make the precise date of completion of the 2006 volumes uncertain. Current editors, Michael E. J. Masson, PhD, Mary Beth Kenkel, PhD, and Jane GoodmanDelahunty, PhD, JD, respectively, will receive and consider manuscripts through December 31, 2005. Should 2006 volumes be completed before that date, manuscripts will be redirected to the new editors for consideration in 2007 volumes. In addition, the P&C Board announces the appointment of Thomas E. Joiner, PhD (Department of Psychology, Florida State University, One University Way, Tallahassee, FL 32306-1270), as editor of the Clinician’s Research Digest newsletter for 2007–2012.

Validity of the construct of Right-Wing Authoritarianism and its ...

Construct validity and vulnerability to anxiety: A ...

Validity of the construct of Right-Wing Authoritarianism and its ...

Construct. Trades_COS.pdf

Experiences of discrimination: Validity and ... - Semantic Scholar

Nonsymbolic numerical magnitude comparison-reliability and validity ...

Experiences of discrimination: Validity and ... - Semantic Scholar

TP1.4.7 ALFA CRONBACH 2.pdf

Ecotourism as a Western Construct

Language as a Biological Construct - On the Intrinsic Variability and ...

Enhancing the Validity and Cross-Cultural ...

Challenging the reliability and validity of cognitive measures-the cae ...

Truth and Evidence in Validity Theory

validity and reliability study of rosenberg self-esteem ...

Duress, Deception, and the Validity of a Promise - Oxford Journals