Exam#2 Study Guide Final.pdf

Viewer
Transcript

Running head: EXAM#2 STUDY GUIDE EPSY625

1

Exam#2 Study Guide EPSY 625, Spring 2017 Validity: Validity is the ability of a test to measure what it supposed to measure. Or it tells whether the measure (test) fulfill the purpose for what it is designed. There are two purposes: Measurement: Locating an individual on a latent variable (attribute dimension), Prediction: Using test information to estimate individual’s status on an external variable. Latent Variable (lie hidden): (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Content validity 1. Is it possible that test scores could be highly reliable yet they might not be valid in the construct sense? Why or why not? Yes. Because, a test may be measuring “something” in a reliable way, yet is not valid in the construct sense unless that “something” is what the test was designed to measure. 2. What is content validity? What is the typical procedure to evaluate this type of validity? Whether the specific content of a test adequately represent the entire content domain or not. It is testing validity of a specific content with respect to the entire content domain. Typical Procedure for evaluating content validity: 1. Defining a domain: A detailed set of content specifications are produced by clearly defining content domain to serve as a guide for judging content validity. 2. Evaluating test items w.r.t to content specifications: Test items are created (or evaluated existing test items) w.r.to the content specifications (expert judgement can be used here as there are no quantitative indices for content validity). 3. Validity Judgement: Is done in two ways; inclusiveness, and freedom from irrelevant content. 3. What do inclusiveness and freedom from irrelevant content mean in evaluating content validity? Give one example for each concept. Inclusiveness refers to the coverage of content of the test w. r. to the entire domain. It describes whether a test adequately covers the entire range of content present in the domain or not? Violating inclusiveness means some areas of content are not included in the test.

Running head: EXAM#2 STUDY GUIDE EPSY625

2

Freedom from irrelevant content describes whether a test excludes the content that is truly irrelevant to the domain. Content validity is said to be compromised if a test includes items that are not the part of the intended content domain. These issues are only examined during construction of the test. Construct validity Construct validity: Tests whether a test is adequately constructed for what variable it is intended to measure. “The degree to which a test measures what it claims”. 1. Determine someone’s relative positions on a hypothetical construct. 2. No expert judgement used. 3. Empirical evidences are gathered to support or refute (counter) construct interpretation of the test. 4. What are the internal and external approaches to construct validation? Internal approach

External approach

1. Evaluates the consistently of correlations 1. Evaluates consistency of correlations between internal items of the test. (local between our measure (a test) and independence) external measures (variables). 2. Construct hypothesis on what test measures and how many latent variables (dimensions) might underlie the test items.

2. Construct hypothesis for our test and external variables

3. Evaluate whether the items relate with each other the way hypothesized.

4. Evaluate whether the items relate with external variable the way hypothesized.

3. Collect empirical evidences (Convergent or Divergent) to support or to refute the hypothesis on how measures relate to ‘other variables’ if the test is construct valid.

Local independence (is used to determine number of latent variables): Means, no relationship exists between item scores after taking latent variables into account. Dimensionality (of an item set): Defined as the smallest value of r (hypothesized latent variables) such that local independence holds. FA (factor analysis) is the tool to test construct validity by confirming a hypothesis for r (latent variable) and a relation between items and factors (addresses dimensionality issue).

Running head: EXAM#2 STUDY GUIDE EPSY625

5. What are the two primary goals of factor analysis (CFC) as an internal approach in evaluating construct validity? 1. To determine common factors. 2. To determine the relation between factors and test items. (which variables goes with which factor) 6. Describe the common factor model analogous to a linear regression model. Common factor model ‘defines latent variables that underlie a set of items as “common factors.” Common factors are unmeasured. It is analogous to linear regression model (measured variables as criteria, and common factors as predictors) Observed (measured) score (dependent variable) for the ith person on the jth variable (in CFM, i and j are used separately) is written as:

• • • •

ij = intercept (not important in CFM). ij = Factor loading (correlation between W and X). Wij = common factor (shared variance for each measured variable). ij = Unique factor (Unique variance for each measured variable, not explained by common factor).

Additional assumptions: 1) Common and unique factors do not correlate: Cov(Wij, ij) = 0. 2) Unique factors do not correlate: Cov(ij, im) = 0. This assumption is dropped for longitudinal data. Communality: Proportion of variance in measured variable due to common factor is the communality for that variable. High factor loading means high communality. So, communality is the proportion of common factor in the total variance. Uniqueness: Proportion of variance in measured variable due to unique factor is the uniqueness for that variable. Counterpart the communality. High uniqueness means variable shares least variance with the other variables.

3

Running head: EXAM#2 STUDY GUIDE EPSY625

4

7. What are the differences between EFA and CFA? How is the number of factor determined in EFA and CFA? Exploratory Factor Analysis (EFA)

Confirmatory Factor Analysis (CFA) (Used for multiple factors dimensions)

1. Explores variables to see which item 1. Choose goes with which factor. model.

factors

using

hypothesized

2. Factor loadings are not uniquely defined. 2. Matching factor loadings are specified as per the hypothesized model (do not determine others). 3. Rotations are performed to extract 3. No rotations are performed. factors that improves interpretation. (orthogonal solution is hard to interpret). 4. Explores best fitting model by using 4. Fitting of proposed model is tested factors and rotations. against the data.

8. What are the implications of orthogonal and oblique rotations in EFA with regard to factors? (i) Orthogonal rotation: preserves uncorrelated factors, but changes the loadings. (ii) Oblique rotation: Allows common factors to be correlated. 9. What are the advantages of CFA over EFA? Unlike EFA, CFA can directly represent a hypothesized factor model using parameter constraints, learning to tests of fit for this structure. CFA is more useful especially if multiple factors are needed. In addition, CFA is a more research-based approach. 10. Describe the process in collecting convergent and divergent evidences in external approach in construct validation? Convergent evidences

Divergent or discriminant evidences

1. Empirical evidences are gathered to confirm the hypothesis

1. Empirical evidences are gathered to refute (disprove) alternative explanations

2. Pose a question: If our measure is valid how it should relate to the external variables A, B, C, etc.

2. Look for alternative explanations if test is not measuring the target construct.

Running head: EXAM#2 STUDY GUIDE EPSY625

5

3. List all developed implications.

3. List other implications if the test was not measuring alternative construct.

4. Collect data to verify predicted relations. (The richer the set of predictions, the better)

4. Refute alternative explanation if our original construct interpretation is correct.

For example, a valid measure of reading comprehension should relate positively to any other existing reading comprehension measures and to grades in coursework that requires reading.

For reading comprehension example, we can develop several alternative explanations for what the test might be measuring: 1) General knowledge, and 2) Test-taking skills. We would then design studies to examine each of these alternative explanations. For 1) - compare 2 groups of students taking w and w/o the reading parts. For 2) - train some in test-taking skills and compare scores with others.

Discriminate measure and its construct interpretation from competing explanations.

Method effects:

𝑟𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑠𝑓𝑟𝑜𝑚



On using different methods

𝑠𝑎𝑚𝑒 𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡

𝑟𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑠𝑓𝑟𝑜𝑚



On using same methods

𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡

1. Correlation between measures from same construct gets reduced on using different methods. 2. Correlation between measures from different constructs gets inflated (exaggerated) on using same method. Multitrait multi method (MTMM) strategy examines both method effects simultaneously. 11. What is the multitrait multi method (MTMM) strategy proposed by Campbell & Fiske (1959)? What is this strategy for? What are the Campbell and Fiske’s rules for convergent and divergent validities in MTMM matrix?

Running head: EXAM#2 STUDY GUIDE EPSY625

6

In MTMM, a set of constructs (T1, T2, T3) are measured using a set of methods ((M1, M2, M3) and scores are assembled in a correlation matrix, which is known as multitrait-multimethod correlation matrix.

Trait

Method

Rules for convergent and discriminant evidence (based on correlation patterns): 1) Reliabilityacross methods – should be reasonably high and similar. (Otherwise, different attenuation levels will arise) 2) Convergent validities (𝐦|𝐡) – should be the highest in the MTMM matrix assuming that methods are independent (no correlation). (𝐦|𝐡) = monotrait – heteromethod) It means, a trait should highly correlate to itself than with any other trait while using different methods. 3) Divergent validities (monomethod correlations) – should not be higher than (𝐡|𝐡) correlations. (𝐡|𝐡) = heterotrait – heteromethod). Otherwise, common method effects will inflate monomethod correlations. 4) Correlation patterns among traits should be approximately same in all blocks. 12. Describe CFA approaches to examine method effects in MTMM data. Include description of the “traits-only” model and the traits and methods model. Advantage of CFA in MTMM is that MTMM structure can directly be tested for fit using CFA. Models are tested sequentially, starting with the most parsimonious model that specifies traits only. “In CFA model that include both T & M factors, we estimate proportion of variance for each measure that is due to trait, method, or due to unique factor.”

Running head: EXAM#2 STUDY GUIDE EPSY625

7

It means, we can have: a) CFA traits-only model b) CFA traits & methods model • •

These models are often unstable as gives improper parameter estimates. Large samples are needed for such models. Criterion-related (CR) validity (Measure related validity)

13. What is criterion-related (CR) validity and how can it be evaluated? CR validity refers to the strength of association between a validated measure and an external criterion measure. We infer something on someone’s standing on a given criterion using information provided by the test (e.g. predicting academic achievement, job performance or consumer buying behavior). Typical approach to assess CR validity: 1. Collect data on the test as well as on the criterion, 2. Use data to quantify the strength of association. Usual correlation coefficient (rxy) is used which is called CR validity coefficient. 14. What are the problems you might encounter when studying CR validity? 1. Measurement error: Random measurement error in criterion attenuates correlation between the test and the criterion. To fix attenuation, we use correlation between a test and the criterion:

(x – test, y – criterion) Attenuation correlation can be applied to observed correlation and corrections can be made either for test or for criterion, or for both. For perfectly reliable measures 𝑟𝑥𝑦 = 𝑟𝑇𝑥𝑇𝑦 . Unreliability is a common issue with criterion measures. We need good reliability estimates to get accurate dis-attenuated correlations. 2. Unrepresentative and range-restricted samples: Range restriction can cause problems in test selection or in other selection procedures. (Because we do not admit students below SAT score 2000 and we don’t get all GPA’s in sample) 3. Small validation sample sizes: When validation sample is small (N<50), it can cause sampling error in estimating population correlation. This sampling error can lead to higher or lower validity in the sample. The same test might have varying validity across multiple samples due to sampling error.

Running head: EXAM#2 STUDY GUIDE EPSY625

8

There are two types of prediction accuracy (how accurate our prediction is?): (i) Metric accuracy: Tells closeness of predicted value with the actual value. (ii) Decision accuracy: Correctness of the decision based on predicted score or test score. 15. What are the 3 measures of decision accuracy? Define each index of measuring decision accuracy. Decision accuracy: Tells whether a decision based on a prediction using a test is accurate.

1. Success ratio (positive predictive value) = TP / (TP + FP). It is positively related to the criterion-related validity rxy. Minimum standard for decision accuracy: Success ratio > Base rate [(TP+FN)/(TP+TN+FP+FN); proportion of people who succeeded). 2. Sensitivity = TP / (TP + FN). Sensitivity is a measure of how sensitive it is for a test to accurately predict “successes” for those who would be success on Y. With high sensitivity, if anyone would be success on Y, the test results would always be almost positive. 3. Specificity = TN / (TN + FP). It is the proportion of individuals who are predicted as “failures” among those who would be failures on Y. With high specificity, if anyone would be failure on Y, then the test would be almost always negative. 16. What is a dual paradox in decision accuracy? What can we do about it? Sensitivity and specificity changes on using different cut-off values. Raising cut-off on test (X) decreases sensitivity and increases specificity and lowering cut-off on test increases sensitivity and decreases specificity. We can focus on one index over the other index depending on the type of test. Sensitivity is more important in screening tests (e.g., Autism), and specificity is more important to confirm the results from a screening test.

Running head: EXAM#2 STUDY GUIDE EPSY625

9

Biases in Measurement and Prediction 17. What is measurement invariance? Give an example of violation of measurement invariance. No measurement bias  Measurement invariance. Mathematically, measurement invariance for X is said to exist if and only if P(X | W, G) = P(X | W) , for all (X,W,G). Where X – observed score of test item, W – latent variable, G – group membership indicator. It means two individuals from different groups having same scores on W should have the same chance of getting a particular score on X. Example of violation of measurement invariance: Within a group of men and women who are identical on science achievement, men get higher scores on the science achievement test than women. Definition does not say that the distribution of scores on X will be the same across groups. These distributions could differ even if there is no bias, provided that the distributions on W differ across groups. Definition says that if X is invariant, any systematic group differences in scores on X are due to group differences on latent variable W not due to the bias. The implication is that we cannot decide whether bias is present by simply comparing overall distributions of observed scores across groups. 18. Describe CFA approaches to examine measurement invariance. 19. What is predictive invariance? Give an example of violation of predictive invariance. No predictive bias  Predictive invariance. Mathematically, predictive invariance for Y is said to exist, if and only if P(Y | X, G) = P(Y | X) for X, Y, G. Where Y – criterion score, X – test score, G – group membership indicator. It means, predicted scores for two individuals on Y, based on X scores, should be same regardless of their groups. Example of violation of predictive invariance: We use same average achievement scores for students from two different classes to predict SAT scores. However, we find that the predictive scores are much higher in class one than those in class two. 20. What is the typical approach to examine predictive invariance? Linear regression approach is used when test scores (X) and criterion measure (Y) are continuous. In linear regression, group difference in regression lines is the evidence for predictive biases. Then, Ho = Groups have identical regression slope and intercepts.

Running head: EXAM#2 STUDY GUIDE EPSY625

10

IRT IRT models express the probability of achieving a given item score as a function of characteristics of the person and of the item. In IRT, a person’s trait level is estimated based on both person’s responses to items and the properties of the items. Main idea in the IRT models: Distance between a person and the item location predicts the probability of passing the item. 1. What are the limitations of CTT? How does IRT address each of the CTT problems? Limitations of CTT IRT Takes whole test into account to decide Based on individual test item (Composite) (Item focused) 1. CTT is test oriented. 1. IRT is item oriented (Tells how examinee Doesn’t give information on test items. responded to an item). Item level analysis allows to select items that are most useful for the test. Precise positioning of items permit greater accuracy with shorter tests. 2. CTT is based on true scores of test forms and their scoring procedures (norm referenced). CTT uses test in which set of items are fixed. Changing set of items changes true score, even items measure the same thing.

2. IRT is based on latent variables (ability or trait scales) which are independent of set of items.

3. In CTT, item statistics are the function of both items and samples (people who took it).

3. In IRT item parameters are separate from latent variable scales (item difficulty and item discrimination) and need not depend on samples. Hence, results are a function of an item instead of the sample.

IRT allows adaptive testing (Different examinees gets different set of items).

If items fits a common IRT model (models to test math, science), the item parameters can be estimated with the different set of examinees, all leading to similar estimates. 4. CTT does not model standard error of measurement as SEM is assumed to be identical for all individuals. It means, same test is reliable for all.

4. In IRT, ‘reliability’ is replaced by ‘item information function’ which is different across people (more realistic).

Running head: EXAM#2 STUDY GUIDE EPSY625

Or we can say that, CTT does not model error of measurement as varying across people.

11 The “standard error of estimation”, analogous to standard error of measurement in CTT, is inversely related to the information function.

What is IRT: IRT models express the probability of achieving a given item score as a function of characteristics of the person and of the item. In IRT, a person’s trait level is estimated based on both person’s responses to items and the properties of the items. 2. What is item response function (IRF)? What is the most important element that determines item response function across all IRT models for ordinal items? IRF (provide information on score as well as on item): Defined as the function that determines the probability of getting a particular score on a given item for a given latent variable. (90% for a depression scale let us say). [For cognitive items (ability) – IRF tells probability of passing an item]. [For non-cognitive items (attitude), IRF tells probability of agreeing with the item]. The distance between the person and location of the item (b). (Item location = difficulty parameter) 3. How is a difficulty parameter defined in IRT? What is the interpretation of the difficulty parameter? Difficulty parameter (b) is a measure of how difficult the item is. For example, in cognitive tests, high value of b means only people at the high end of the latent variable distribution will be likely to pass the item. Rasch model is highly restrictive as items in this model vary in their difficulty level, but do not differ in other ways (no discrimination). The model often fails to fit real data. 4. How is a discrimination parameter related to IRF? What is the interpretation of the discrimination parameter? ‘Discrimination’ (or slope) parameter denotes difference in performance and it is related to the steepness of the item response function. It is given as the rate of change of function with respect to person’s latent variable score. Looks across scales. High value (i.e. 2) mean that item characteristics curves are very steep. In other words, difference on the latent variable between individuals would lead to a big difference on their probabilities of passing (endorsing) the item: the item is highly discriminating.

Running head: EXAM#2 STUDY GUIDE EPSY625

12

High discrimination value means item can reliably distinguish individuals who differ on latent variable. 5. What is the item information function? What is the test information function? ‘Item information function provides all information on examinees that is available at each point on the latent variable scale of the test.’ Item information function in IRT is the replaced classical notion of reliability in CTT. An item does not provide equal amount of information about all examinees. Item information depends on latent variable scale (the region). When item response function increases quickly, it shows that the item is providing high information about examinees in that region of the latent variable scale. When the item response function is nearly flat, means item provides little information about examinees in that region. • The steep part of item response function corresponds to the part with the highest information value. • Once item information function for all items in a test are available, the test information function can be created as a function of all the item information functions. 6. What are the major areas of IRT applications? 1. Test construction: IRT allows to perform item analysis and then selecting items to yield a test that has a desired information function. It can also be used to construct multiple forms of the test that are equivalent in their measurement properties, assuming that we have a large enough item pool. 2. Test equating: Putting multiple forms on a common metric of equivalency is a common issue. We fit the items in each form to a common IRT model, either using a sample that has taken both forms, or using a common linking subset of items. IRT then provides the required equating. 3. Computerized adaptive testing (CAT): CAT uses responses of an individual on previous items to provide a new item. Different examinees may end up taking different forms of the test. Computer adaptive testing currently in use requires a large pool of potential items with available item parameter estimates and item information functions. 4. Studying measurement bias: IRT is particularly useful in the studying measurement bias at item level.

Running head: EXAM#2 STUDY GUIDE EPSY625

13

Measurement invariance holds if item response functions are same across groups for an unbiased item in terms of measurement. So, item parameter estimates should be same across population under no bias condition. 7. What is DIF (Differential Item Functioning)? ‘A test item is labeled with “DIF” when people with equal ability, but from different groups, have an unequal probability of item success.’ Measurement is said to be biased if differential item functioning (DIF) exists for the groups being compared.

Exam#2 Study Guide Final.pdf

Defining a domain: A detailed set of content specifications are produced by clearly. defining ... longitudinal data. ... Displaying Exam#2 Study Guide Final.pdf.

Download PDF

883KB Sizes 3 Downloads 137 Views

Report

Exam#2 Study Guide Final.pdf

Recommend Documents