Journal of Educational Psychology 2007, Vol. 99, No. 4, 775–790

Copyright 2007 by the American Psychological Association 0022-0663/07/$12.00 DOI: 10.1037/0022-0663.99.4.775

Do University Teachers Become More Effective With Experience? A Multilevel Growth Model of Students’ Evaluations of Teaching Over 13 Years Herbert W. Marsh Oxford University Do university teachers, like good wine, improve with age? The purpose of this methodological/ substantive study is to apply a multiple-level growth modeling approach to the long-term stability of students’ evaluations of teaching effectiveness (SETs). For a diverse cohort of 195 teachers who were evaluated continuously over 13 years (6,024 classes, an average of 30.9 classes per teacher), there was little evidence that teachers became either more or less effective with added experience. This stability of SETs generalized reasonably well over undergraduate- and graduate-level courses, early career teachers, and teachers who differed substantially in terms of their teaching effectiveness overall. Whereas there were substantial individual differences between teachers in terms of their teaching effectiveness, these individual differences were also highly stable over time. Although highly supportive of the use of SETs for many purposes, the results provide a serious challenge for existing programs that assume that SET feedback alone is sufficient to improve teaching effectiveness. Keywords: students’ evaluations of teaching effectiveness, long-term stability and change, multilevel growth modeling

Methodologically, I demonstrate a multilevel approach to growth modeling that is ideally suited to this substantive issue but has not previously been applied in this area of research. This methodological approach provides a basis for combining in a single analytic approach questions of whether teaching effectiveness averaged across this group of teachers increased, decreased, or remained stable over time (mean stability); how well these group level results generalize across individual teachers; the extent of agreement in ratings of the same teacher from one year to the next (covariance stability); and how characteristics of the teacher and the courses they teach influence mean and covariance stability. There are many approaches to the study of stability and change (Collins & Sayer, 2001; Duncan, Duncan, & Strycker, 2006; Goldstein, 2003; Nesselroade & Baltes, 1979; Plewis, 1985; Rogosa, Brandt, & Zimowski, 1982; Rogosa, Floden, & Willett, 1984; Willett, 1988; Willett & Sayer, 1994). The two most common, however, refer to the stability of means over time (mean stability) and to the stability of individual differences over time (covariance stability). In each case it is desirable to have longitudinal data in which the same teachers are evaluated on many different occasions over an extended period of time. This is particularly relevant because more recent multilevel growth modeling approaches to growth and change provide exciting new approaches to combining the evaluation of both mean and covariance stability within a single analytic framework. A number of researchers (e.g., Bausell & Bausell, 1979; Gilmore, Kane, & Naccarato, 1978; Kane, Gillmore, & Crooks, 1976; Kulik & McKeachie, 1975; Marsh, 1981, 1987, 2007; Rindermann & Schofield, 2001) have examined correlations between ratings of the same instructor in different offerings of the same course, the same teacher in different courses, and different teachers teaching the same course in an attempt to disentangle the relative influence

Students’ evaluations of teaching effectiveness (SETs) have been the topic of considerable interest and a great deal of research in North America and, increasingly, in universities all over the world. SET research has been motivated by the traditional importance of teaching in universities, an increasing emphasis on monitoring the quality of university teaching, and a focus on teaching and learning outcomes for quality assurance exercises in higher education (Marsh, 2007). SETs are widely endorsed by teachers, students, and administrators and have stimulated much research spanning nearly a century. Numerous studies have shown SETs to be valid in relation to a variety of outcome measures broadly accepted by classroom teachers (e.g., learning inferred from classroom and standardized tests, student motivation, plans to pursue and apply the subject, positive affect, experimental manipulations of specific components of teaching, ratings by former students, classroom observations by trained external observers, and even teacher self-evaluations of their own teaching effectiveness; see Marsh, 1987, 2007; Marsh & Roche, 1997). The focus of the present investigation is on the long-term stability of ratings of the same teachers during the course of their teaching career. Substantively, the study evaluates the stability of university teaching effectiveness on the basis of SETs for a cohort of teachers who were evaluated continuously over a 13-year period to answer the question of whether teaching effectiveness increases, decreases, or remains stable with added experience.

I would like to acknowledge helpful comments on drafts of this article by Harvey Goldstein, Jon Rasbash, and Alison O’Marra. Correspondence concerning this article should be addressed to Herbert W. Marsh, Department of Education, University of Oxford, 15 Norham Gardens, Oxford OX2 6PY, UK. E-mail: [email protected] 775

776

MARSH

of the course and the teacher. Using a path analytic approach, Marsh (1981) found that SETs were primarily a function of the teacher who teaches a course rather than the course that is taught. For overall ratings of the instructor and of the course, the correlations between ratings of different instructors teaching the same course (one estimate of the course effect) were ⫺.05 and ⫺.0l, respectively, whereas correlations between ratings for the same instructor in two different classes (.61 and .59) and in two different offerings of the same course (.72 and .71) were much larger. Based on these findings, Marsh (1987, 2007) concluded that SETs are primarily a function of the teacher who teaches a course rather than the course that is being taught. These studies of covariance stability support the practice of aggregating ratings across different courses and suggest that individual differences in teaching effectiveness are stable over time, but they do not address the issue of mean stability. Overall and Marsh (1980; Marsh & Overall, 1979) examined stability in a longitudinal study in which the same students evaluated teachers at the end of the course and retrospectively, several years after finishing the course but at least 1 year after graduation from the program. They showed that mean ratings were nearly the same at both points in time and that class-average responses for the end-of-term responses correlated .83 with the retrospective ratings. There was little systematic difference between the mean ratings collected at the end of the term and the retrospective ratings collected several years latter. This study demonstrates that the perspectives of the same students do not change over time and counters the claim that students would evaluate instructors differently after being called upon to apply course materials in further coursework or after graduation. These results, however, address the mean and covariance stability of responses by the same students about teaching effectiveness for a given course and not the stability of the teaching effectiveness in different courses over an extended period of time. Sadly, there is a broad range of longitudinal and particularly of cross-sectional research demonstrating that without systematic intervention, teaching effectiveness—at all levels, no matter how measured—tends to decline with age and years of teaching experience. Based on cross-sectional studies at the primary and secondary school level that used a wide variety of indicators of teaching effectiveness, Ryans (1960; also see Barnes, 1985) reported an overall negative relation between teaching experience and teaching effectiveness. Ryans suggested, however, that there was an initial increase in effectiveness during the first few years, a leveling-out period, and then a period of gradual decline. In her review of subsequent research since the early 1960s, Barnes (1985) reached a similar conclusion. She further reported that teaching experience beyond the first few years was associated with a tendency for teachers to reject innovations and changes in educational policy. At the university level, Feldman (1983; also see Blackburn & Lawrence, 1986; Feldman, 1977, 1997; Horner, Murray, & Rushton, 1989; Marsh, 1987, 2007; Marsh & Dunkin, 1997; Marsh & Hocevar, 1991a; Murray, 1990, 1997; Renaud & Murray, 1996) reviewed studies relating overall and specific dimensions of SETs to teacher age, teaching experience, and academic rank. Feldman reported that SETs were only weakly related to these three measures of seniority. Overall evaluations tended to be negatively correlated with age and—to a lesser extent—years of teaching

experience but positively correlated with academic rank. Thus, younger teachers, teachers with less teaching experience, and teachers with higher academic ranks tended to receive somewhat higher evaluations. Consistent with the reviews by Ryans (1960) and Barnes (1985), Feldman noted that in the few studies that specifically examined nonlinear relations, there was some suggestion of an inverted U-shaped quadratic relation in which ratings improved initially, peaked at some early point, and then declined slowly thereafter. More generally, Renaud and Murray (1996) suggested that previous research typically ignored potential nonlinear effects, but they reported that there was little systematic support for the existence of nonlinear effects. There is a large, well-documented body of literature indicating that use of SETs as part of a systematic intervention to improve teaching effectiveness does lead to improved teaching effectiveness. In his classic meta-analysis, Cohen (1980) found that instructors who received midterm feedback were subsequently rated about one third of a standard deviation higher than controls on the total rating (an overall rating item or the average of multiple items). Studies that augmented feedback with consultation produced substantially larger differences, but other methodological variations had little effect (also see L’Hommedieu, Menges, & Brinko, 1990). In a successful intervention strategy based on the Students’ Evaluation of Educational Quality (SEEQ; Marsh & Roche, 1993), teachers discussed with consultants their relative strengths and weakness based on SEEQ feedback, targeted specific areas in which to improve their teaching, selected strategies on how to improve teaching in the targeted areas based on welldocumented strategies provided by the consultant, discussed with the consultant how to implement the selected strategies, and monitored improvement in subsequent SEEQ evaluations. Relative to randomly assigned control teachers, feedback teachers improved their overall teaching effectiveness by about .5 SD, and even larger differences were found for targeted areas that were the focus of their individually structured intervention. The most robust finding from the SET feedback research (Marsh, 1987, 2007) is that consultation augments the effects of written summaries of SETs, but simply returning SETs to teachers without consultation has limited effect. Consistent with research showing that teaching effectiveness does not improve with experience, Marsh (2007) argued these results suggest that teachers do not know how to improve their teaching effectiveness without a systematic intervention incorporating SET feedback and external consultation. A major limitation in research relating teaching experience and teaching effectiveness is that most studies have considered ratings collected in one specific course on a single occasion. There is surprisingly little research on the stability of mean ratings received by the same teacher over an extended period of time (see related discussion by Feldman, 1983; Horner et al., 1989; Marsh & Hocevar, 1991a). Cross-sectional studies like those reviewed by Feldman (1983) provide a poor basis for inferring what ratings younger, less experienced teachers will receive later in their careers or what ratings older, more experienced teachers would have received earlier in their careers. However, the direction of a likely selection bias (more effective teachers are more likely to be retained) is in the opposite direction from the observed results. Thus, cross-sectional results are likely to be conservative in relation to this bias. Furthermore, Murray (1990; also see Horner et al., 1989; Renaud & Murray, 1996) reported similar results for cross-

STUDENTS’ EVALUATIONS OF TEACHING

sectional and longitudinal designs based on SETs collected over a 20-year period. Nevertheless, there are important limitations in the use of cross-sectional data for evaluating how ratings of the same instructor vary over time. In the present investigation, I address this limitation in existing research by an application of the multilevel growth modeling approach to simultaneously evaluate mean and covariance stability in the SETs of these teachers over an extended period of time representing a substantial portion of their academic careers.

The Present Investigation The Research Context Data are based on an archive of SETs based on the SEEQ instrument (Marsh, 1982, 1984, 1987, 2007; Marsh & Bailey, 1993; Marsh & Hocevar, 1991a, 1991b). This archive contains class-average ratings for almost 50,000 classes collected over a 13-year period at one large private research-oriented university in the United States. For purposes of the present investigation, all teachers who were evaluated at least once during each of 10 different years during the 13-year period under consideration are included. This process identified 195 different teachers who had been evaluated in a total of 6,024 different courses, for an average of 30.9 classes per teacher. As the results of the present investigation are based upon a single university, it is relevant to describe the context. Typically SEEQ instruments were distributed to faculty shortly before the end of each academic term, administered by a student in the class or by administrative staff according to standardized written instructions, and taken to a central office where they were processed. Although an academic unit’s participation in this program was voluntary, the university required that all units systematically collect some form of SETs and did not consider any personnel recommendations (e.g., tenure, promotion, merit) that did not include SETs. Thus, most academic units that used SEEQ required all teachers to be evaluated in all courses. Although the SETs at this university have a long history of being broadly accepted, readily available, and widely used, there was no systematic program of teacher development or intervention based on the SETs other than feedback based on SEEQ (for a summary of the nature of this feedback, see Marsh, 1987; also see Marsh & Roche, 1994). Although they come from a single university, the extensive set of published results based on data from this university (e.g., Marsh, 1980, 1982, 1984, 1987; Marsh & Roche, 2000) are broadly consistent with findings from other SET research (Marsh, 2001, 2007; Marsh & Dunkin, 1997). It is also relevant to ask whether ratings across all teachers at this university changed over the period to provide a frame of reference against which to evaluate changes in the cohort of teachers in the longitudinal sample considered here. Particularly pertinent is the Marsh and Roche (2000) study showing that there was little systematic change in the overall teacher rating on SEEQ across all undergraduate social science classes taught during this period (the combination of linear, quadratic, and cubic effects of year in which the data were collected explained less than 0.5% of the variance). Also, perceptions of workload– difficulty and expected grades were reasonably stable over this period. Marsh and Roche further reported that even the correlations between the

777

overall teacher evaluation and expected grades (mean r ⫽ .20), and between the overall teacher evaluation and expected grades and perceived workload– difficulty (mean r ⫽ .19, higher workload– difficulty associated with higher ratings), were stable over this period. Although this prior research did not evaluate the growth trajectories of individual teachers, it did demonstrate that SETs and their relations with key covariates were reasonably stable over this time—facilitating interpretations of results from the present investigation.

Research Questions 1.

Do ratings of the same teachers collected over a 13-year period systematically increase, decrease, or remain stable over time (mean level stability)?

2.

Are there large, systematic individual differences in the mean level stability of the SETs such that some teachers systematically improve, whereas others systematically get worse?

3.

How highly correlated are ratings of the same teacher from 1 year to the next (covariance stability)?

4.

SETs are typically better for graduate-level courses than undergraduate-level courses. Is the mean level stability of SETs systematically different in graduate- and undergraduate-level courses? Are some teachers relatively better at teaching undergraduate than graduate classes?

5.

Is there any evidence that mean stability and the nature of the growth function are different for early career teachers? In the SET literature (see earlier discussion), there is some evidence that early career teachers might show a quadratic growth function (an initial increase followed by a subsequent decline), whereas older, more experienced teachers show a small, largely linear decline over time. In order to test this possibility, linear and quadratic growth functions—and their interaction with early career status—are considered.

Method Sample The sample considered here consisted of all teachers who were evaluated at least once during each of 10 different years in a 13-year period. The resulting group included 195 different teachers who had been evaluated on an average of 30.9 classes per teacher. This diverse cohort of teachers came from 31 different academic departments, including a broad cross-section of classes from the social sciences, humanities, sciences, and professional schools. Hence, on average, each teacher was evaluated several times during each of the 13 years considered here. This sample includes a combination of undergraduate- and graduate-level courses. Across all teachers there were 3,699 (61%) undergraduate classes and 2,325 (39%) graduate classes. However, the proportions of graduate-level courses varied considerably (M ⫽ .39, SD ⫽ .27); a few teachers taught only graduate-level

778

MARSH

courses (n ⫽ 2), and some taught only undergraduate-level courses (n ⫽ 12). It is well established in the SET literature that graduatelevel classes receive systematically higher ratings than undergraduate-level classes (e.g., Marsh, 1987). Hence, course level is a potentially confounding factor in evaluating stability. Alternative approaches to this issue in SET research have included considering only undergraduate classes or conducting separate analyses of graduate- and undergraduate-level courses. Here I incorporate course level as one of the independent variables in the analysis as well as do separate analyses on undergraduate- and graduate-level courses as a supplement to the overall analyses.

Statistical Analysis Traditionally, studies of stability have focused on either mean stability (whether ratings averaged over teachers go up or down over time), or covariance stability (how highly correlated ratings by the same teacher are over time, as in test–retest correlations). Whereas mean level stability can be evaluated cross-sectionally, longitudinal data facilitates interpretations so that changes over time are not confounded with differences in individuals measured at different points in time. Covariance stability can only be evaluated with longitudinal data. These two approaches to stability have typically been considered separately, using analysis of variance–type analyses to evaluate mean stability and correlational techniques to evaluate covariance stability. However, the popularization of growth modeling techniques (Duncan et al., 2006; Goldstein, 2003; Raudenbush & Bryk, 2002; Snijders & Bosker, 1999; Wen, Marsh & Hau, 2002) and multilevel statistical packages facilitates the incorporation of both approaches to stability into a single, more appropriate analytical framework. Statistical analyses in the present investigation were based on the commercially available MLwiN statistical package (Rasbash, Steele, Browne, & Prosser, 2005; also see Goldstein, 2003). For purposes of these analyses, a three-level analysis was considered (Level 3 ⫽ teacher, Level 2 ⫽ year in which the SETs were collected over the 13-year period, Level 1 ⫽ class, so that it was typical for the same teacher to be evaluated in more than one class per year). Predictor (fixed) variables consisted of year (the 13 years, coded ⫺6 . . . 0 . . . ⫹6) and course level (0 ⫽ undergraduate, 1 ⫽ graduate). Because information on years of teaching experience was not available, an early career status (0 ⫽ not early career, 1 ⫽ early career) variable was constructed based on the academic rank of academics. Academics who were assistant professors during their first 3 of the 13 years considered in this study were designated as “early career.” Linear and quadratic components of time were determined using orthogonal contrasts (linear and quadratic) to appropriately test the interaction between the nature of the growth function and early career status. (See earlier discussion of Research Question 4; cubic growth functions are not included because preliminary analyses showed them to be nonsignificant, and there were no a priori hypotheses or research questions that focused on the cubic component.) To facilitate interpretations, all independent and dependent variables—including dummy variables—were standardized (M ⫽ 0, SD ⫽ 1) over the entire sample of 6,024 classes (for further discussion on the advantages of this standardization strategy, see Marsh & Rowe, 1996; also see Aiken & West, 1991).

The dependent variable in all analyses was the class-average rating—the overall teacher rating averaged across all responding students in each of the 6,024 classes. A potentially complicating factor is that the number of students responding in each of these classes varied substantially, ranging from 1 to 618 (M ⫽ 24.39, Mdn ⫽ 17, SD ⫽ 30.1); 24% of the class-average responses were based on fewer than 10 students. The reliability of class-average evaluations varies with the number of students per class. Based on results from the SEEQ archive, Marsh (1987) estimated the reliability of the class-average response to be .95 for 50 students, .90 for 25 students, .74 for 10 students, and .60 for 5 students. Hence, the number of students responding within each class is a confounding factor in the evaluation of stability and SETs research more generally. Alternative approaches to this issue have included ignoring the issue, considering only courses with a minimum sample size (e.g., 10), recommending that SETs based on a small number of classes be interpreted cautiously, or taking the average of responses across different courses so that the issue is eliminated. In the present investigation I addressed this issue in two ways. First, in an apparently novel approach to this problem, I used the weighting procedure in MLwiN so that each class was weighted by the number of students responding. Second, I supplemented these overall analyses with separate analyses based on the classes in which the number of student responding was at least 10.

Results Preliminary Single-Level Analyses For purposes of illustration, I begin with descriptive statistics to depict general trends in the data that will be analyzed more fully in subsequent multilevel growth models. A box-plot of ratings over the 13 years (see Figure 1) shows a high degree of mean stability over time in that there are neither large systematic increases nor decreases over time. Within each year there is a consistent difference between undergraduate- and graduate-level courses in that graduate-level courses receive higher ratings. This same pattern is shown with the scatter plot in which there is almost no linear or nonlinear relationship between time and ratings (see Figure 2). Both approaches indicate that there is considerable variation in the ratings for different teachers within each of the 13 years. However, it is important to emphasize that these traditional approaches to evaluating mean stability do not take into account the longitudinal nature of data in which the same teachers have ratings in different years. This feature of the data invalidates traditional single-level tests of statistical significance based on analysis of variance, multiple regression, and structural equation models. This is a critical limitation. It is not easy to depict the covariance stability of the ratings in this study, because of the nature of the data (graduate- and undergraduate-level courses and varying numbers of classes per year). To illustrate this covariance stability, I aggregated all the ratings within a given year for the same teacher separately for undergraduate- and graduate-level courses. Based on undergraduate classes (Table 1), there are substantial correlations between ratings received by the same teacher from one year to the next (mean r ⫽ .57). Although the ratings show a small simplex pattern in which correlations among ratings are somewhat larger in adjacent years (i.e., correlations adjacent to the diagonal of the matrix;

STUDENTS’ EVALUATIONS OF TEACHING

779

Figure 1. Box-plot graphs of ratings of 6,024 undergraduate- and graduate-level university courses taught by a group of 195 teachers who were evaluated continuously over a 13-year period. Boxes represent the median response (the dark horizontal line in the middle of box), and the 25th and the 75th percentiles. The error bars reflect the 10th and 90th percentiles.

Table 1), and smaller for ratings that are separated by more time, this effect is not substantial. Thus, for example, ratings in Year 1 correlate .59 with ratings in Year 2 and .60 with ratings in Year 13. The covariance stability was systematically lower for graduatelevel courses (mean r ⫽ .40). There is also a substantial biasing effect due to the small numbers of students evaluating each class, and teachers are typically recommended to interpret the results cautiously when there are fewer than 10 students per class. For undergraduate classes with at least 10 students per class, the average correlation among the different years is .61 (compared to .57 in Table 1). In summary, preliminary results suggest that the SETs have a high degree of stability in terms of both mean stability (mean ratings over time) and covariance stability (test–retest correlations across the 13 years). There are, of course, limitations in these analyses that restrict their usefulness. In particular, it was not possible to easily integrate information about the stability of individual teachers in the single-level analyses of the means—the extent to which the mean stability observed across all teachers generalized to each teacher considered separately, and whether trends for individual teachers varied systematically in relation to characteristics of the teachers or the courses that they taught. In more appropriate analyses of the data, I now pursue these issues in more appropriate multilevel growth models of these data.

Multilevel Analyses Stability of ratings over time. Results from a set of multilevel growth models with three levels (Table 2) are based on various combinations of (fixed) predictor variables. For each of these

models there are three levels: Level 3 ⫽ teacher; Level 2 ⫽ time; Level 1 ⫽ class (reflecting the fact that most teachers had more than one set of evaluations in many years). The models varied in terms of the (fixed) predictor variables that were considered: linear and quadratic components of time (years), course level (graduate vs. undergraduate), and a dichotomous variable representing early career status. In the first (variance components) model, there are no predictor variables. This model provides a baseline against which to compare other models but also provides variance components for each of the three levels. Inspection of Model 1 (Table 2) demonstrates that there is a substantial variance component associated with teacher (.41). Because the variables are standardized, approximately 41% of the total variance [.41/(.41 ⫹ .16 ⫹ .45)] can be explained by the teacher, indicating that the same teacher consistently gets systematically higher or lower ratings than other teachers (covariance stability). The variance component associated with time (.16) is much smaller, indicating relatively smaller amounts of variability within teachers over time. The final variance component (.45) is residual, unexplained variance that includes measurement error, differences associated with the different classes taught in the same year, and all other sources of variation not included in the model. In Model 2 the linear effect of time is added as a predictor (fixed effect) variable. Although the effect is statistically significant (i.e., the parameter estimate is more than twice its standard error, and the change in deviance is large relative to the one additional parameter in the model), the effect is very small (⫺.05). Thus, the results suggest that there is a very small decline in mean ratings

MARSH

780

Figure 2. Scatter diagram representing the ratings of 6,024 undergraduate- and graduate-level university courses taught by a group of 195 teachers who were evaluated continuously over a 13-year period. The overall teacher rating was regressed on year, and separate graphs (including linear and quadratic polynomials) were constructed for undergraduate- and graduate-level courses. Rsq ⫽ multiple R2.

over time. The variance components remain almost unchanged, although the random variance component due to time does decrease slightly. In Model 3 I allowed the linear effect of time to be random at the teacher level. This residual variance component is highly significant, indicating that there are systematic differences between the linear trends associated with different teachers—some go up with time and some go down with time. However, this teacher-to-teacher variation in the linear effect of time is very small (.03; the residual variance component of time when it is

made random at the teacher level in Model 3, Table 2), indicating that differences among teachers in this linear trend is not large. Also of interest in Model 3 is the residual covariance term reflecting the relation between the intercept for each teacher (in this case roughly the average rating for teachers at Year 7, the middle of the 13 years since year is zero-centered) and slope (the linear change in ratings over years). Despite the very large sample size, this residual covariance term is not statistically significant. This means that growth in ratings (in this case, almost complete

Table 1 Covariance Stability: Correlations Between Ratings of the Same Teacher Across 13 Years Year 1 2 3 4 5 6 7 8 9 10 11 12 13

1

2

3

4

5

6

7

8

9

10

11

12

13

— .59 .46 .56 .57 .56 .55 .53 .53 .56 .49 .50 .60

.59 — .53 .53 .58 .49 .53 .56 .58 .49 .56 .54 .56

.46 .53 — .54 .61 .46 .64 .49 .58 .40 .52 .52 .53

.56 .53 .54 — .52 .54 .61 .61 .59 .57 .46 .52 .41

.57 .58 .61 .52 — .67 .70 .73 .64 .58 .62 .60 .59

.56 .49 .46 .54 .67 — .67 .66 .61 .49 .50 .56 .62

.55 .53 .64 .61 .70 .67 — .71 .66 .62 .59 .62 .61

.53 .56 .49 .61 .73 .66 .71 — .68 .69 .64 .64 .52

.53 .58 .58 .59 .64 .61 .66 .68 — .64 .52 .65 .50

.56 .49 .40 .57 .58 .49 .62 .69 .64 — .54 .57 .48

.49 .56 .52 .46 .62 .50 .59 .64 .52 .54 — .68 .59

.50 .54 .52 .52 .60 .56 .62 .64 .65 .57 .68 — .71

.60 .56 .53 .41 .59 .62 .61 .52 .50 .48 .59 .71 —

Note. N ⫽ 3,699 undergraduate classes taught by 195 teachers who were evaluated continuously over a 13-year period. For each of the 195 teachers, all ratings for a given year were aggregated across undergraduate courses to form a single year aggregate rating for each teacher in each year. Average correlation is .57.

STUDENTS’ EVALUATIONS OF TEACHING

781

Table 2 Multilevel Growth Models of Stability of SETs of the Same Teachers Over 13 Years T-L Variable

Model 1 b (SE)

Model 2 b (SE)

Model 3 b (SE)

T-Q Model 4 b (SE)

Course Model 5 b (SE)

Course & time Model 6 b (SE)

Course, time, early Model 7 b (SE)

⫺0.05 (.018)* ⫺0.02 (.015) 0.17 (.024)* 0.01 (.017) 0.01 (.015) 0.03 (.048) 0.01 (.019) ⫺0.01 (.017) ⫺0.04 (.019)*

⫺0.05 (.018)* ⫺0.02 (.015)

Fixed effects Level 1: Teacher-level predictors T-L T-Q Course T-L ⫻ Course T-Q ⫻ Course E E ⫻ T-L E ⫻ T-Q E ⫻ Course E ⫻ T-L ⫻ Course E ⫻ T-Q ⫻ Course

⫺0.05 (.017) ⫺0.05 (.017) ⫺0.05 (.013) ⫺0.01 (.013) 0.16 (.024)*

⫺0.05 (.017) ⫺0.01 (.015) 0.17 (.024)* 0.01 (.017) 0.01 (.014)

0.01 (.020) ⫺0.01 (.018) Residual variance components

Level 3: Teacher Time(L) Time(Q) Course T (L) ⫻ Course T (Q) ⫻ Course Level 2: Year Level 1: Residual

*

0.41 (.043)* 0.41 (.043)* 0.41 (.042)* 0.03 (.006)* 0.03 (.006)* 0.02 (.006)*

0.39 (.044)* 0.02 (.006)* 0.02 (.005)* 0.07 (.014)*

0.16 (.015)* 0.45 (.022)*

0.15 (.015)* 0.12 (.014)* 0.11 (.013)* 0.45 (.022)* 0.45 (.022)* 45 (.022)*

0.10 (.012)* .40 (.019)*

0.41 (.043)

0.39 (.038)* 0.02 (.006)* 0.01 (.004)* 0.07 (.014)* 0.02 (.005)* 0.01 (.005)* 0.09 (.012)* 0.38 (.019)*

0.38 (.038)* 0.03 (.006)* 0.01 (.004)* 0.07 (.013)* 0.02 (.006)* 0.01 (.004)* 0.09 (.012)* 0.38 (.012)*

0.01 (.013) ⫺0.01 (.011) ⫺0.04 (.016)* 0.02 (.017)* 14,363.1

0.01 (.013) ⫺0.01 (.011) ⫺0.04 (.017)* 0.02 (.017)* 14,359.6

Residual covariance componentsa Teach with T-L Teach with T-Q Teach with course T-L with T(L) ⫻ course Deviance summary

0.02 (.019)

14,978.8

14,962.8

14,910.0

0.02 (.012) 0.00 (.016) 14,885.9

0.01 (.011) 0.00 (.012) ⫺0.04 (.018)* 14,417.4

Note. N ⫽ 6,024 classes. All outcome and predictor variables were standardized (M ⫽ 0, SD ⫽ 1) at the individual student level for the entire sample of 6,024 classes. The dependent variable in all analyses is the overall teacher rating in each of the 6,024 classes. In each analysis, a three-level model was evaluated: Level 3 ⫽ teacher; Level 2 ⫽ time; Level 1 ⫽ class. (Teachers were typically evaluated in more than one class each time so that there are multiple classes for each teacher-time combination). In different models, linear and quadratic components of time (the 13 years), course level (1 ⫽ undergraduate, 2 ⫽ graduate), and early career status (1 ⫽ early career at Year 1 of the study, 2 ⫽ not early career status) are considered. All parameter estimates are statistically significant when they differ from zero by more than two standard errors (in parentheses). SETs ⫽ students’ evaluations of teaching effectiveness; T-L ⫽ time-linear; T-Q ⫽ time-quad; E ⫽ early career. a All remaining residual covariance terms not listed were nonsignificant and excluded from presentation. * p ⬍ .05.

lack of growth) is similar for teachers whose teaching effectiveness is high, average, or low. Although substantively important, I do not discuss this result in relation to subsequent models because this term is nonsignificant in all models considered here. It is also interesting that allowing the linear trend to be random reduced the random effect associated with time from .15 to .12. Hence, variation in the slope of linear growth component accounts for some of the within-teacher variation in years. In Model 4 I added the quadratic effect of time as a predictor variable and allowed this to be random at the teacher level. Whereas the quadratic effect of time was not significant, there was a statistically significant but very small random component that was significant—indicating that there is some teacher-toteacher variation in this component of time. In Figure 3 I show the plots based on individual teachers from this (random intercept-slope) Model 4, which demonstrates how little varia-

tion there is in the slope of the time function for individual teachers. The effect of course level. In Model 5 I added course level (graduate vs. undergraduate) as a predictor variable and allowed it to be random at the teacher level. The positive effect of course level reflects the fact that SETs are systematically higher in graduate-level courses than undergraduate-level courses, and this effect is highly significant. Furthermore, the residual variance component for course level at the teacher level is statistically significant and moderately large. This indicates that there are systematic differences between teachers in the extent to which ratings are higher in graduate-level courses—suggesting that some teachers are relatively more effective in graduate-level courses, whereas others are relatively more effective in undergraduate-level courses. Some of this difference might reflect a ceiling effect, particularly in graduate-level classes in which ratings are higher

782

MARSH

Figure 3. Each of the 195 grey horizontal lines represents the linear and quadratic effects of year for a different teacher, whereas the black line represents the average function across the 195 teachers. Although there is significant variation in the linear and quadratics components for each teacher, the extent of this variation is small (see Table 2, Model 4). There is, however, substantial variation in the intercepts associated with each teacher, indicating substantial variation between teachers in overall teaching effectiveness that is consistent over time.

than in undergraduate-level classes. However, some teachers get higher ratings in undergraduate classes than in graduate-level classes. In Model 6 I added interactions between course level and the polynomial components of time and allowed these to be random at the teacher level. The fixed effects of the interaction between course level and time were nonsignificant for both the linear and quadratic growth components—the higher mean rating for graduate courses is consistent over time. However, the residual variance components associated with the teacher were statistically significant for these interaction effects. Thus, for example, the difference between ratings of undergraduate- and graduate-level courses increases systematically with increases in time for some teachers, whereas it declines for others. Although statistically significant, these residual variance components are very small. In this model there is also a significant negative covariation between the teacher intercept and the residual course effect, indicating that the teacher effect is larger for undergraduate classes and smaller for graduatelevel courses. Figure 4 illustrates the relative sizes of the main effects from this model with caterpillar plots (which show the mean and 95% confidence intervals for each teacher in relation to variables included in the model). These clearly show that teachers vary substantially in terms of teaching effectiveness, less in terms of differences in their effectiveness in graduate and undergraduate classes, and very little in terms of linear and quadratic growth components. Indeed, focusing on the linear component of time,

there are only a few teachers at each end of the graph (teachers are ranked from lowest to highest) for whom the 95% confidence interval does not include the mean growth across all teachers— only a few teachers have systematically more positive growth or systematically more negative growth than the sample as a whole. In contrast, there are somewhat larger differences associated with course level; some teachers have relatively lower ratings in undergraduate classes than graduate-level classes (those on the left-hand side of the caterpillar plot), whereas others have relatively higher ratings in graduate-level courses than undergraduate-level courses (those on the right-hand side of the caterpillar plot). Even here, however, the majority of teachers (those with error bars that include the total sample mean) do not differ significantly from results based on the total sample (i.e., somewhat higher ratings in graduate-level classes). Early career status. In the teacher effectiveness literature, there is some suggestion that teaching effectiveness improves for teachers early in their teaching career, followed by a gradual decline in subsequent years—a quadratic growth function. This suggests that early career teachers should differ from other teachers in terms of linear and particular quadratic growth components. To test this suggestion, in Model 7 I added the early career (dichotomous) variable and its interactions with other predictor variables to test whether the linear and particularly the quadratic growth functions vary as a function of early career status. However, the very small change in the deviance statistic associated with the addition of all these variables suggests that the effects are small

Figure 4. One hundred ninety-five teachers ranked in terms of how they differ in terms of teaching effectiveness (teacher intercept), in which each of the 195 vertical lines represents the teacher intercept (in terms of teaching effectiveness) with an error bar (⫾ 1.96 standard errors). Other graphs depict the teacher-to-teacher variation in the effect of polynomial effects of year (linear and quadratic components of the 13 years considered in this study) and course level (undergraduate vs. graduate). Residual components are significant for all four effects (see Table 2, Model 6) but are much larger for teacher intercepts than for course level and particularly the polynomial components of year. Teacher-inter ⫽ teacher intercept; Year-Quadr ⫽ year-quadratic.

STUDENTS’ EVALUATIONS OF TEACHING

783

784

MARSH

(see Models 6 and 7 in Table 2). Consistent with this observation, the effect of early career status is nonsignificant, as are its interactions with linear and quadratic growth components. Although there is a very small (but statistically significant, .01 ⬍ p ⬍ .05) interaction between early career status and course level (early career teachers get slightly lower rating in graduate-level courses), not even this effect varies with time. Hence, the substantial stability of ratings of teaching effectiveness is very consistent across early career teachers and their more senior colleagues. Note on residual covariance terms. Because the many residual covariance terms in the random part of the various models were all small and mostly nonsignificant, I have not focused on them. However, given the large sample size, this lack of covariation is a substantively important result. Particularly interesting is the lack of covariation between teacher intercept (approximately the mean rating at Year 7, the middle year, as all variables were standardized) and the components of time. These were nonsignificant in all models, indicating that the lack of growth in teaching effectiveness did not vary as a function of the overall effectiveness of the teacher. More effective teachers are no more likely than less effective teachers to show positive or negative growth. No residual covariance terms were significant at p ⬍ .01, but one was significant at p ⬍ .05—the residual negative covariance between the teacher intercepts and course level. More effective teachers tended to do relatively better in undergraduate-level classes so that the difference between ratings in undergraduate- and graduate-level classes tended to be smaller for more effective teachers. However, even this residual covariance term is very small. (Also see the Appendix where the full multilevel regression equation, including full matrix of residual variance-covariance terms, is presented for Model 7.)

Multilevel Analyses of Selected Subsamples To supplement the results and establish comparability with other research, I have applied two of these models with selected subsamples of the total sample (Table 3). In particular, because the main focus of much SET research is on undergraduate courses and courses with at least 10 students, I evaluate the stability of teaching effectiveness in subsamples composed of various combinations of these variables. In Sample A (see Models 1A and 7A in Table 3, corresponding to Models 1 and 7 in Table 2), I consider classes with at least 10 students responding (n ⫽ 4,596 classes). In the variance component model (1A in Table 3), the variance component for the teacher is substantially larger than for the total sample (.46 vs. .41). There is a corresponding decrease in the residual variance components at Levels 2 (years) and 3 (residual). These results, of course, reflect—at least in part—the fact that class-average ratings based on at least 10 students are systematically more reliable than class averages based on fewer than 10 students (and suggest that this difference was not fully accounted for by weighting classes by the number of students responding in each class). In Model 7A, based on this same subsample, the fixed effects of linear time and course level remain statistically significant, although the effect of course level is somewhat smaller than that for the total sample. For this subsample, the fixed effect of early career is not statistically significant, nor does it interact significantly with effects of time or course level. In summary, except for the larger variance compo-

nent for teachers in this subsample, the results are similar to those in the total sample. In the next two sets of models I consider all undergraduate classes (Subsample B; n ⫽ 3,699 classes; see Models 1B and 7B in Table 3) and undergraduate classes with at least 10 students (Subsample C; n ⫽ 3,031 classes; see Models 1C and 7C). Because these models are all based on only undergraduate classes, course level is excluded from the analyses. In each of these models, the size of the teacher effect is substantially larger than it is for the total sample: .51 and .54, respectively, compared to .41 in the total sample. This increase in the teacher effect is accompanied by a corresponding decrease in the residual variance terms. Hence, the ratings show much more covariance stability for undergraduate classes than for the total sample. In other respects, the results based on undergraduate classes are similar to the corresponding model based on the total sample—particularly the small negative effect linear growth component and the nonsignificant quadratic growth component, early career status, and interactions between these effects. In the next two sets of models I consider all graduate classes (Subsample D; n ⫽ 2,325 classes; see Models 1D and 7D in Table 3) and graduate classes with at least 10 students (Subsample E; n ⫽ 1,450 classes; see Models 1C and 7C). For these graduate classes, the teacher component is substantially lower (.40 and .47, respectively) than it is in the corresponding models with undergraduate classes (.51 and .54). It is interesting that for these graduate classes, the fixed effects (linear and quadratic growth components, early career status, and interactions) were statistically nonsignificant— even the linear growth component that showed a very small, statistically significant decline in all other models. However, this result should not be overinterpreted, as linear effect of time did not interact significantly with course level in earlier results (Model 6 in Table 2). In summary, results of these supplemental analyses demonstrated that the mean stability of the ratings generalize very well across these various subsamples. In contrast, there were substantial and predictable differences between the various subsamples in terms of covariance stability. Of particular interest is the sample of undergraduate classes in which classes with few students responding were excluded—the basis of much SET research. For this subsample of 3,031 undergraduate classes collected over 13 years, a remarkable 54% of the variance in the ratings could be explained in terms of the teacher. Not only were the ratings stable over time, but effective teachers consistently received positive ratings, whereas less effective teachers consistently received lower ratings.

Discussion and Implications Do teachers become more or less effective with added experience? The present investigation combined new, evolving methodology to address this critical question that is relevant to all levels of education. Sadly, there exists a body of research showing that teaching effectiveness tends to decline—not improve—with added experience. There are, however, several important caveats to this generalization. First, the vast majority of studies evaluating this phenomenon are based on cross-sectional data rather than true longitudinal data. Cross-sectional studies cannot evaluate covariance stability at all and are not ideally suited for the evaluation of mean stability because of potential selection effects. Second, there

10,681.3

0.14 (.015)* 0.40 (.022)*

0.46 (.048)

*

Model 1A b (SE)

0.02 (.009) 10,241.6

0.40 (.040) 0.03 (.009)* 0.02 (.007)* 0.08 (.018)* 0.03 (.009)* 0.01 (.004) 0.07 (.011)* 0.35 (.019)*

*

⫺0.06 (.020)* ⫺0.02 (.017) 0.09 (.028)* 0.02 (.021) 0.02 (.016) 0.03 (.050) ⫺0.02 (.021) 0.01 (.019) ⫺0.04 (.023) 0.00 (.022) 0.01 (.019)

Model 7A b (SE)

8,670.2

0.16 (.019)* 0.36 (.025)*

0.51 (.054)

*

Model 1B b (SE)

Model 1D b (SE)

0.54 (.060)*

8,595.2

5,856

0.10 (.016)* 0.15 (.019)* 0.36 (.024)* 0.33 (.022)* Residual covariance componentsa

0.50 (.053)* 0.02 (.008)* 0.02 (.007)*

8,505.9

0.09 (.016)* 0.33 (.022)*

0.53 (.056)* 0.03 (.008)* 0.03 (.009)*

0.06 (0.54) ⫺0.01 (.022) 0.00 (.019)

0.08 (.053) ⫺0.01 (.022) 0.01 (.019)

Residual variance components

⫺0.08 (.020)* 0.06 (.054)

⫺0.07 (.019)* ⫺0.02 (.018)

Model 7D b (SE)

C: Undergraduate n ⬎ 9 (3,031 classes)

Fixed effects

Model 7B b (SE)

B: All undergraduate (3,699 classes)

6,747.4

0.17 (.027)* 0.45 (.034)*

0.40 (.054)*

Model 1C b (SE)

6,637.9

0.10 (.023)* .45 (0.34)*

0.39 (.049)* 0.07 (.018)* 0.02 (.044)

0.00 (.054) ⫺0.02 (.033) 0.00 (.003)

⫺0.04 (.030) 0.00 (.024)

Model 7C b (SE)

D: All graduate (2,325 classes)

3,440.1

0.14 (.027)* 0.39 (.022)*

0.47 (.0071*

Model 1D b (SE)

3398.8

0.08 (.022)* 0.38 (.034)*

0.44 (.063)* 0.08 (.024)* 0.01 (.014)

⫺0.01 (.061) ⫺0.03 (.043) 0.01 (.034)

⫺0.04 (.038) 0.00 (.028)

Model 7D b (SE)

E: Graduate n ⬎ 9 (1,450 classes)

Note. All outcome and predictor variables were standardized (M ⫽ 0, SD ⫽ 1) at the individual student level for the entire sample of 6,024 classes. The dependent variable in all analyses is the overall teacher rating in each of the classes. In each analysis, a three-level model was evaluated: Level 3 ⫽ teacher; Level 2 ⫽ time; Level 1 ⫽ class. In different models, linear and quadratic components of time (the 13 years), course level (1 ⫽ undergraduate, 2 ⫽ graduate), and early career status (1 ⫽ early career at Year 1 of the study, 2 ⫽ not early career status) are considered. All parameter estimates are statistically significant when they differ from zero by more than two standard errors (in parentheses). For analyses presented here, various subsets of the cases were considered for selected models (also see Table 2). SETs ⫽ students’ evaluation of teaching effectiveness; T-L ⫽ time-linear; T-Q ⫽ time-quad; E ⫽ early career. a All remaining residual covariance terms not listed were nonsignificant and excluded from presentation. * p ⬍ .05.

T-L with T(L) ⫻ Course Deviance summary

Level 3: Teacher Time(L) Time(Q) Course T(L) ⫻ Course T(Q) ⫻ Course Level 2: Year Level 1: Residual

Level 1: Teacher-level predictors T-L T-Q Course T-L ⫻ Course T-Q ⫻ Course E E ⫻ T-L E ⫻ T-Q E ⫻ Course E ⫻ T-L ⫻ Course E ⫻ T-Q ⫻ Course

Variable

A: Total N ⬎ 9 (4,596 classes)

Table 3 Multilevel Growth Models of Stability of SETs of the Same Teachers Over 13 Years: Supplemental Analyses

STUDENTS’ EVALUATIONS OF TEACHING

785

786

MARSH

is some evidence from cross-sectional research suggesting that teaching effectiveness does improve over the first few years of teaching, followed by a gradual decline in teaching effectiveness. Third, most research is based on studies without any systematic interventions designed to improve teaching effectiveness. However, particularly at the university level, there is ample support for the effectiveness of systematic interventions designed to enhance teaching effectiveness (e.g., Cohen, 1980; Marsh, 2007; Marsh & Roche, 1993, 1997; Overall & Marsh, 1979). Fourth, studies of mean stability are typically based on results aggregated across many teachers so that, perhaps, there are large individual differences for particular teachers—some improving and others declining—that are lost when averaged across teachers. The results of the present investigation address many of these limitations of previous research. Based on a diverse sample of 195 university teachers who were evaluated continuously over a 13year period, the present investigation is well-suited to evaluate both mean and individual stability from an appropriate longitudinal perspective. Importantly, the results provide clear results in an area in which there has been much confusion. University teaching effectiveness based on SETs is highly stable in terms of both mean and covariance stability. Because this sample contained teachers from all levels of teaching experience, it was possible to show that linear and quadratic growth functions were very small and did not differ significantly for early career and more experienced teachers. Particularly for mean level stability, the stability generalized well over undergraduate- and graduate-level classes, over teachers who were early career and those who were more experienced at the start of the study, and teachers who were more or less effective teachers. Whereas some teachers improved with time and others got worse, most showed very little systematic change in teaching effectiveness. Teachers who were relatively poor teachers at the start of the study mostly remained poor teachers, whereas those who were relatively good teachers at the start of the study mostly remained good teachers throughout the 13 years. Across the spectrum of good to bad teaching, teachers did not get systematically more effective with experience, but neither did they become less effective. At least for the diverse sample of teachers in this study, the results provide a clear and unambiguous answer to the question that I originally posed. Teaching effectiveness is remarkably stable, suggesting that teachers do not gain from experience. It is, however, relevant to speculate on why the results of this study apparently differ from conclusions based on previous research. Let me preface these remarks with the observation that the differences are not substantial. I found a very small negative linear growth component that was significant from a statistical perspective (due to the large sample size) but apparently not from a practical perspective. Although previous research has tended to find a negative relation, in the most comprehensive review of this research at the university level, Feldman (1983) found that almost half the studies he reviewed found no significant relation between SETs and various measures of experience and age. However, several other observations are relevant. Whereas there is ample evidence to suggest that SETs need to be supplemented with consultation to improve teaching effectiveness, results of the present investigation suggest that the expected decline in teaching effectiveness over time might be diminished through the institution of a systematic program of SETs. Nevertheless, there now exists a

substantial body of research indicating that the collection of SETs coupled with systematic intervention that includes consultation with an external consultant is substantially more effective than SET feedback alone (see Marsh, 1987, 2007). In addition to the substantive implications of this research, there are also important methodological implications. The multilevel growth modeling approach introduced here has important advantages over the single level analyses used in previous studies. With this approach, I was able to systematically combine the issues of mean and covariance stability into a single analytic framework that provided further insight into both these aspects of long-term stability. The interpretation of strong mean stability was greatly facilitated by the finding of strong covariance stability. Without demonstrating strong covariance stability, the mean stability might have meant that measures of teaching effectiveness were random and ratings of the same teacher were highly inconsistent from one time to another. However, the multilevel growth models showed that the results based on the group average generalized well to individual teachers. Similarly, it would be possible for teachers to be highly stable over time in the sense of covariance stability, but not to be stable in terms of mean stability (i.e., to show systematic increases or decreases over time). With a multilevel growth modeling approach, I was able to show that there were systematic differences in the teaching effectiveness of individual teachers and that these differences were highly stable over time. The teaching effectiveness of the individual teachers—as well as the group of teachers as a whole— did not systematically increase or decrease over time. The results of the present investigation also raised a number of new issues, some of which reflect limitations in the present investigation or directions for further research. Because selection was based on teachers who had taught at this one university for an extended period of time, there is a potential selection bias. However, to the extent that tenure decisions at research-oriented universities are based substantially on research track records, this potential bias probably had little effect on results of the present investigation. Furthermore, the results based on this true longitudinal sample are largely consistent with results based on crosssectional research from the same university (e.g., Marsh & Roche, 2000) that does not suffer from this problem (but has many other limitations). This, of course, still leaves open the question of how well the results based on a single university generalize to those based on other universities. Although results based on a large number of published SET studies from this university show good generalizability to results from SET research more generally (see reviews by Marsh & Dunkin, 1997; Marsh, 1987, 2007), this does not guarantee that is the case for these results as well. Unfortunately, there is a dearth of longitudinal studies that track the ratings of the same cohort of teachers over such an extended period of time with which to evaluate the generalizability of these findings. Most research focuses on undergraduate teaching, whereas I systematically evaluated teaching effectiveness in both undergraduate- and graduate-level courses. I confirmed the wellestablished finding that SETs are higher in graduate level courses, but the multilevel growth modeling approach raised a number of new aspects to this finding. Averaged across all teachers, the difference in teaching effectiveness for undergraduate- and graduate-level courses did not vary with time. However, from a covariance stability perspective, teaching effectiveness in under-

STUDENTS’ EVALUATIONS OF TEACHING

graduate classes was more stable over time. More important from an applied policy perspective, there were moderately large individual differences in this course level effect. Some teachers were relatively more effective at teaching undergraduate courses, whereas others were relatively more effective in graduate-level courses. A surprising finding in the present investigation— despite the large sample size—was the very small, mostly nonsignificant residual random covariances between overall teaching effectiveness and other characteristics of individual teachers. Thus, for example, growth components were not systematically related to overall teaching effectiveness. Hence the mean stability in teaching effectiveness over time was consistent for teachers who were high, medium, and low in terms of teaching effectiveness. In fact, only the residual covariance between course level and teacher was consistently significant, suggesting that the difference between teaching effectiveness in graduate- and undergraduate-level courses was smaller for more effective teachers. A methodological issue that was not fully resolved in the present investigation was how to incorporate sample size—the number of students evaluating each class—into the analyses. Because the number of responding students varied from 1 to 681, this was a potentially important issue. In much SET research this issue is implicitly recognized by considering only the classes in which the number of responding students is greater than some minimum number (e.g., 10). Here, I considered several alternative approaches. First, all of the 6,024 class-average ratings were weighted by the number of cases making up the class average. Thus, ratings of large classes contributed more than ratings by small classes. Not surprisingly, weighting the results by the number of students in each class marginally increased support for covariance stability (variance component for teachers was .41 in Table 3 compared to .37 in the unreported, corresponding analysis without weighting). However, supplemental analyses indicated that this weighting was not sufficient, in that separate analyses of classes in which more than 10 students responded resulted in systematically larger teacher components (.46 for all undergraduate and graduate courses, and .54 for undergraduate-level classes). A potentially important limitation in the present investigation in relation to this issue is that analyses were based on class-average responses; individual student-level data (146,925 surveys—an average of 24.39 students per class) were not available. Individual student level data would have complicated the analyses, but they would have provided a systematic evaluation and correction for unreliability due to lack of agreement among students within the same class. Whereas this is conceptually feasible to do in multilevel growth modeling, there are few applications and, apparently, none in the SET literature. Another interesting feature of the multilevel growth modeling approach considered here is the consideration of covariates at each of the different levels of analysis—the level of the individual teacher (either constant over time or varying with time) or the level of the individual class. Thus, for example, early career status at the start of the study is a characteristic of teachers that does not vary over time (other examples might include gender, ethnicity, and nationality), whereas other teacher characteristics might be expected to vary over time (e.g., academic rank, research productivity, administrative responsibilities). Here, course level is a covariate that varies at the level of the individual class. Whereas a wide

787

variety of different individual class characteristics have been considered in the SET literature as sources of potential bias or validity (e.g., expected grades, class size, workload/difficulty, prior subject interest, mastery of course materials; see Marsh, 2007), few of these studies have taken a multilevel perspective. Although clearly beyond the scope of the present investigation, a multilevel growth modeling approach would allow researchers to explore how these characteristics change over time and to relate these changes to changes in SETs. This opens up a potentially fruitful new direction for SET research to provide a better understanding of variables typically posited as potential biases to SETs. Another critical feature of the present investigation is the importance of maintaining appropriate longitudinal SET databases. As emphasized here and by many other researchers, there are many important issues in SET research that cannot be addressed adequately with cross-sectional data. From an applied perspective, it is also valuable to maintain longitudinal archives of the SETs so that major personnel decisions are based on the best available data. Gilmore and colleagues (1978; Kane et al., 1976) suggested that ratings for a given instructor should be averaged across different courses to enhance generalizability. If it is likely that an instructor will teach many different classes during his or her subsequent career, then tenure decisions should be based upon as many different courses as possible. This requires that a longitudinal archive of SETs be maintained for personnel decisions. These data would provide the basis for more generalizable summaries, the assessment of changes over time, and the determination of which particular courses are best taught by a specific instructor. From a formative perspective, teachers should be given the most useful feedback about their teaching effectiveness. Thus, for example, although there continues to be debate over how to construct normative data comparisons, it is clear that maintaining an appropriate longitudinal archive of ratings provides a much richer array of possibilities. Indeed, even for those who argue against the use of norms (see discussion by McKeachie, 1996) the evaluation of systematic change over time in SETs of the same teacher would also provide an alternative basis of comparison that is not based on how ratings of a given teacher compare with those of other teachers. It is most unfortunate that some universities systematically collect SETs but fail to keep a longitudinal archive of the results (Marsh, 2007). In summary, this longitudinal growth modeling study provides a reasonably clear answer to the question posed in the title. For this diverse cohort of 195 teachers who were evaluated continuously over a 13-year period, there was little evidence that teachers became either more or less effective with added experience. This mean stability generalized reasonably well over undergraduateand graduate-level courses, teachers who were early in their career at the start of the study and those who were more experienced, and teachers who differed substantially in terms of their teaching effectiveness overall. Whereas there were substantial individual differences between teachers in terms of their teaching effectiveness, these individual differences were also highly stable over time. Although highly supportive of the use of SETs for many purposes, these results provide a challenge for the way universities typically use SETs to improve teaching effectiveness (L’Hommedieu et al., 1990; Marsh, 2007; Menges, 1991; Murray, 1997; also see earlier discussion). In particular, institution of a

MARSH

788

broadly based program of SETs may arrest the typical decline in evaluations of teaching effectiveness that accompanies additional experience, but there is clear evidence that feedback interventions augmented by external consultation are more effective than SETs alone. In previous research (Marsh, 2007), based on an extensive review of student evaluation research, I concluded that relatively inexpensive, unobtrusive interventions based on SETs that include appropriate assistance from an external consultant can make a substantial difference in teaching effectiveness. This is not surprising, given that university teachers typically receive little or no specialized training on how to be good teachers and apparently do not know how to fully utilize SET feedback without external assistance. I suspect that this regrettable state of affairs actually contributes to the high level of stability of ratings over time. Indeed, I predict that if teachers participate in systematic interventions to improve their teaching effectiveness that have been shown to be effective in other research, their teaching effectiveness will increase with experience. Also, the covariance stability of rating of individual teachers would, perhaps, be smaller ratings as a consequence of participation in systematic interventions.

References Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park, CA: Sage. Barnes, J. (1985). Experience and student achievement/teacher effectiveness. In T. Husen & T. N. Postlethwaite (Eds.), International encyclopedia of education: Research and studies (pp. 5125–5128). Oxford, England: Pergamon Press. Bausell, R. B., & Bausell, C. R. (1979). Student ratings and various instructional variables from a within-instructor perspective. Research in Higher Education, 11, 167–177. Blackburn, R. T., & Lawrence, J. H. (1986). Aging and the quality of faculty job-performance. Review of Educational Research, 56(3), 265– 290. Cohen, P. A. (1980). Effectiveness of student-rating feedback for improving college instruction: A meta-analysis. Research in Higher Education, 13, 321–341. Collins, L. M., & Sayer, A. G. (Eds.). (2001). New methods for the analysis of change. Washington, DC: American Psychological Association. Duncan, T. E., Duncan, S. C., & Stryker, L. A. (2006). An introduction to latent variable growth curve modeling: Concepts, issues, and applications (2nd ed.). London: Erlbaum. Feldman, K. A. (1977). Consistency and variability among college students in rating their teachers and courses. Research in Higher Education, 6, 223–274. Feldman, K. A. (1983). The seniority and instructional experience of college teachers as related to the evaluations they receive from their students. Research in Higher Education, 18, 3–124. Feldman, K. A. (1997). Identifying exemplary teachers and teaching: Evidence from student ratings. In R. P. Perry & J. C. Smart (Eds.), Effective teaching in higher education: Research and practice (pp. 368 –395). New York: Agathon. Gilmore, G. M., Kane, M. T., & Naccarato, R. W. (1978). The generalizability of student ratings of instruction: Estimates of teacher and course components. Journal of Educational Measurement, 15, 1–13. Goldstein, H. (2003). Multilevel statistical models (3rd ed.). London: Hodder Arnold. Horner, K. L., Murray, H. G., & Rushton, J. P. (1989). Relation between aging and rated teaching effectiveness of academic psychologists. Psychology and Aging, 4, 226 –229. Kane, M. T., Gillmore, G. M., & Crooks, T. J. (1976). Student evaluations

of teaching: The generalizability of class means. Journal of Educational Measurement, 13, 171–184. Kulik, J. A., & McKeachie, W. J. (1975). The evaluation of teachers in higher education. Review of Research in Higher Education, 3, 210 –240. L’Hommedieu, R., Menges, R. J., & Brinko, K. T. (1990). Methodological explanations for the modest effects of feedback. Journal of Educational Psychology, 82, 232–241. Marsh, H. W. (1980). The influence of student, course and instructor characteristics on evaluations of university teaching. American Educational Research Journal, 17, 219 –237. Marsh, H. W. (1981). The use of path analysis to estimate teacher and course effects in student ratings of instructional effectiveness. Applied Psychological Measurement, 6, 47– 60. Marsh, H. W. (1982). SEEQ: A reliable, valid, and useful instrument for collecting students’ evaluations of university teaching. British Journal of Educational Psychology, 52, 77–95. Marsh, H. W. (1984). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential biases, and utility. Journal of Educational Psychology, 76, 707–754. Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11, 253–388. Marsh, H. W. (2001). Distinguishing between good (useful) and bad workload on students’ evaluations of teaching. American Educational Research Journal, 38(1), 183–212. Marsh, H. W. (2007). Students’ evaluations of university teaching: A multidimensional perspective. In R. P. Perry & J. C. Smart (Eds.), The scholarship of teaching and learning in higher education: An evidencebased perspective (pp. 319 –384). New York: Springer. Marsh, H. W., & Bailey, M. (1993). Multidimensionality of students’ evaluations of teaching effectiveness: A profile analysis. Journal of Higher Education, 64, 1–18. Marsh, H. W., & Dunkin, M. J. (1997). Students’ evaluations of university teaching: A multidimensional perspective. In R. P. Perry & J. C. Smart (Eds.), Effective teaching in higher education: Research and practice (pp. 241–320). New York: Agathon. Marsh, H. W., & Hocevar, D. (1991a). The multidimensionality of students’ evaluations of teaching effectiveness: The generality of factor structures across academic discipline, instructor level, and course level. Teaching and Teacher Education, 7, 9 –18. Marsh, H. W., & Hocevar, D. (1991b). Students’ evaluations of teaching effectiveness: The stability of mean ratings of the same teachers over a 13-year period. Teaching and Teacher Education, 7, 303–314. Marsh, H. W., & Overall, J. U. (1979). Long-term stability of students’ evaluations. Research in Higher Education, 10, 139 –147. Marsh, H. W., & Roche, L. A. (1993). The use of students’ evaluations and an individually structured intervention to enhance university teaching effectiveness. American Educational Research Journal, 30, 217–251. Marsh, H. W., & Roche, L. A. (1994). The use of students’ evaluations of university teaching to improve teaching effectiveness. Canberra, Australia: Australian Department of Employment, Education, and Training. Marsh, H. W., & Roche, L. A. (1997). Making students’ evaluations of teaching effectiveness effective. American Psychologist, 52, 1187–1197. Marsh, H. W., & Roche, L. A. (2000). Effects of grading leniency and low workloads on students’ evaluations of teaching: Popular myth, bias, validity, or innocent bystanders? Journal of Educational Psychology, 92, 202–228. Marsh, H. W., & Rowe, K. J. (1996). The negative effects of schoolaverage ability on academic self-concept—an application of multilevel modelling. Australian Journal of Education, 40(1), 65– 87. McKeachie, W. J. (1996). Do we need norms of student ratings to evaluate faculty? Instructional Evaluation and Faculty Development, 14, 14 –17. Menges, R. J. (1991). The real world of teaching improvement. In M. Theall & J. Franklin (Eds.), New directions for teaching and learning:

STUDENTS’ EVALUATIONS OF TEACHING No. 48. Effective practices for improving teaching (pp. 21–37). San Francisco: Jossey-Bass. Murray, H. G. (1990, June). Does evaluation of teaching lead to improvement of teaching? Paper presented at the annual meeting of the Society for Teaching and Learning in Higher Education, Montreal, Canada. Murray, H. G. (1997). Does evaluation of teaching lead to improvement of teaching? International Journal for Academic Development, 2(1), 8 –23. Nesselroade, R., & Baltes, P. M. (Eds.). (1979). Longitudinal research in the study of behavior and development. New York: Academic Press. Overall, J. U., & Marsh, H. W. (1979). Midterm feedback from students: Its relationship to instructional improvement and students’ cognitive and affective outcomes. Journal of Educational Psychology, 71, 856 – 865. Overall, J. U., & Marsh, H. W. (1980). Students’ evaluations of instruction: A longitudinal study of their stability. Journal of Educational Psychology, 72, 321–325. Plewis, I. (1985). Analyzing change: Measurement and explanation using longitudinal data. New York: Wiley. Rasbash, J., Steele, F., Browne, W., & Prosser, B. (2005). A user’s guide to MLwiN (Version 2) [Computer software manual]. Bristol, United Kingdom: University of Bristol, Centre for Multilevel Modelling. Available from http://www.cmm.bristol.ac.uk/MLwiN/download/manuals.shtml Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage.

789

Renaud, R. D., & Murray, H. G. (1996). Aging, personality, and teaching effectiveness in academic psychologists. Research in Higher Education, 37, 323–340. Rindermann, H., & Schofield, N. (2001). Generalizability of multidimensional student ratings of university instruction across courses and teachers. Research in Higher Education, 42(4), 377–399. Rogosa, D., Brandt, D., & Zimowski, M. A. (1982). Growth curve approach to the measurement of change. Psychological Bulletin, 92(3) 726 –748. Rogosa, D. R., Floden, R., & Willett, J. B. (1984). Assessing the stability of teacher behavior. Journal of Educational Psychology, 76, 1000 –1027. Ryans, D. G. (1960). Prediction of teacher effectiveness. In C. W. Harris (Ed.), Encyclopedia of educational research (pp. 1486 –1491). New York: Macmillan. Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modeling. London: Sage. Wen, Z., Marsh, H. W., & Hau, K.-T. (2002). Interaction effects in growth modeling: A full model. Structural Equation Modeling, 9(1), 20 –39. Willett, J. B. (1988). Questions and answers in the measurement of change. In E. Rothkopf (Ed.), Review of research in education (Vol. 15, pp. 345– 422). Washington, DC: American Education Research Association. Willett, J. B., & Sayer, A. G. (1994). Using covariance structure-analysis to detect correlates and predictors of individual change over time. Psychological Bulletin, 116, 363–381.

(Appendix follows)

MARSH

790

Appendix Multilevel Regression Equation for Model 8 (See Model 7 in Table 2) Outcome variable: zfq31 ⫽ overall teacher rating. Three levels: casenum ⫽ Individual teacher identification number (Level 3, denoted with subscript k); xyear ⫽ year (Level 2, denoted with subscript j); individual class (Level 1, denoted with subscript k). Each effect has a set of subscripts indicating the levels at which it varies. Thus, teacher characteristics (e.g., early career status) have only a single subscript k, whereas those that can vary at the individual class level (e.g., course level) have ijk subscripts. Predictors not allowed to be random at the teacher level: zearly ⫽ Early Career (E); zearly.zrcyrlin ⫽ E ⫻ Time-Linear (T-L); zearly.zrcrylqua ⫽ E ⫻ Time-Quadratic (T-Q); zearly.Zcrslev ⫽ E ⫻ Course Level (Crse); zearly.zrcyrlin.zcrslev ⫽ E ⫻ T-L ⫻ Crse; zearly.zrcrylqua.zcrslev E ⫻ T-Q ⫻ Crse. For each effect there is a parameter estimate and a standard error (in parentheses).

Predictors made random at teacher level: cons ⫽ constant term (random at all three levels); zrcyrlin ⫽ T-L; zrcrylqua ⫽ T-Q; zcrslev ⫽ Crse; zrcyrlin.Zcrslev ⫽ T-L ⫻ Crse; zrcrylqua.Zcrslev ⫽ T-Q ⫻ Crse. For each of these six effects there is an intercept term (the betas—not shown in Table 2) and variance– covariance matrix of residual variance terms (the gammas— only selected values shown in Table 2). Residual variance terms: Level 3 (teacher) is represented by a variance– covariance matrix of residual variance terms (the gammas with subscript k— only selected values shown in Table 2). Level 2 (year) is represented by a residual variance term (mu with subscripts jk); Level 1 (individual class) is represented by a single residual variance term (e with subscripts ijk).

zfq31 ijk ⬃ N共XB, ⍀兲 zfq31 ijk ⫽ ␤ 0ijk cons ⫹ ␤ 1k zrcyrlin jk ⫹ ␤ 2k zrcyrqua jk ⫹ ␤ 3k zcrslev ijk ⫹ ␤ 4k zrcyrlin.zcrslev ijk ⫹ ␤ 5k zrcyrqua.zcrslev ijk ⫹ 0.032共0.048兲 zearly k ⫹ ⫺0.009共0.019兲 zearly.zrcyrlin jk ⫹ 0.007共0.017兲 zearly.zrcyrqua jk ⫹ ⫺0.044共0.019兲 zearly.zcrslev ijk ⫹ 0.009共0.020兲 zearly.zrcyrlin.zcrslev ijk ⫹ ⫺0.009共0.018兲 zearly.zrcyrqua.zcrslev ijk ␤ 0ijk ⫽ ⫺0.118共0.046兲 ⫹ v 0k ⫹ u 0jk ⫹ e 0ijk ␤ 1k ⫽ ⫺0.054共0.018兲 ⫹ v 1k ␤ 2k ⫽ ⫺0.016共0.015兲 ⫹ v 2k ␤ 3k ⫽ ⫺0.169共0.024兲 ⫹ v 3k ␤ 4k ⫽ ⫺0.011共0.017兲 ⫹ v 4k ␤ 5k ⫽ ⫺0.007共0.015兲 ⫹ v 5k

冤冥 v 0k v 1k v 2k v 3k v 4k v 5k

⬃ N共0, ⍀ v 兲:

⍀v ⫽



0.384共0.038兲 0.014共0.013兲 ⫺ 0.009共0.011兲 ⫺ 0.044共0.017兲 ⫺ 0.003共0.013兲 ⫺ 0.002共0.011兲

0.026共0.006兲 0.000共0.004兲 ⫺ 0.001共0.006兲 0.008共0.005兲 ⫺ 0.006共0.004兲

0.013共0.004兲 ⫺ 0.008共0.005兲 ⫺ 0.003共0.004兲 ⫺ 0.003共0.004兲

0.070共0.013兲 0.016共0.007兲 0.002共0.005兲

0.023共0.006兲 ⫺ 0.009共0.005兲

0.013共0.004兲



关u 0jk 兴 ⬃ N共0, ⍀ u 兲:⍀ u ⫽ 关0.094共0.012兲兴 关e 0ijk 兴 ⬃ N共0, ⍀ u 兲:⍀ e ⫽ 关0.380共0.019兲兴 ⫺ 2*loglikelihood共IGLS Deviance兲 ⫽ 14359.640共6024 of 6024 cases in use兲

Received December 12, 2006 Revision received May 31, 2007 Accepted June 12, 2007 䡲

A Multilevel Growth Model of Students' Evaluations of ...

been the topic of considerable interest and a great deal of research in North ... room and standardized tests, student motivation, plans to pursue and apply the ...

1MB Sizes 1 Downloads 244 Views

Recommend Documents

a multidimensional model of venture growth
platform and a guiding framework for those who fund and manage ventures. .... Paper presented at the annual meeting of the. Academy of Management, San ...

A Vintage Capital Model of Growth and Investment
A Vintage Capital Model of Investment and Growth: Theory and Evidence by. Jess Benhabib. New York University. Aldo Rustichini. AT&T Bell Laboratories and. Northwestern University. * We thank Buzz Brock, Prajit Dutta, Chris Flinn, Roman Frydman, Dermo

Download [Epub] The Student Evaluation Standards: How To Improve Evaluations Of Students Read online
The Student Evaluation Standards: How To Improve Evaluations Of Students Download at => https://pdfkulonline13e1.blogspot.com/0761946632 The Student Evaluation Standards: How To Improve Evaluations Of Students pdf download, The Student Evaluation

Reconsidering Price's model of scientific growth: An ... - Springer Link
number of scientific journals and abstract journals during the 1665-2000 time interval. As Price .... rejected this model as it was considered too abstract and did not take into account ..... (Spanish translation as Hacia una ciencia de la ciencia.

Download [Epub] The Student Evaluation Standards: How To Improve Evaluations Of Students Full Pages
The Student Evaluation Standards: How To Improve Evaluations Of Students Download at => https://pdfkulonline13e1.blogspot.com/0761946632 The Student Evaluation Standards: How To Improve Evaluations Of Students pdf download, The Student Evaluation

Description of evaluations (PDF).pdf
Psychological. A psychological evaluation may include the following: an observation of the ... An individual educational evaluation may include the following:.

Teacher Evaluations - Education Commission of the States
Mar 3, 2018 - type of trusted evaluation system that meaningfully differentiates teacher performance and provides teachers with opportunities ... generally seeking to address one or more of the following: ... include gathering public feedback, sharin

An Empirical Model of Growth Through Product Innovation
Indeed, our estimated model implies that net entry accounts for 21% of ... all firms face the same product quality improvement distribution, hence they choose the ...

A multilevel study of neighborhood networks and ...
multilevel data can such interactions between individual- and neighborhood- ... analysis: regardless of whether the attribute is a risk or a protective factor at the .... added benefits of having yet another neighbor similar to oneself will diminish.

A multilevel study of neighborhood networks and suicide
social interactions, social networks, micro-macro links, and non-contagious .... levels of social integration and regulation will present a higher risk of suicide.

Predictions of a Recurrent Model of Orientation
Jan 3, 1997 - linear and an analytic solution to the network can be found. The biases for certain numbers of peaks in the responses become evident once the ...

Predictions of a Recurrent Model of Orientation
Jan 3, 1997 - run on a network of 5 12 units whose output represents the activity of .... on the initial state of the network. .... (O'Toole & Wenderoth, 1977).

A MULTIPLE SCALE MODEL FOR TUMOR GROWTH 1 ...
†Bioinformatics Unit, Department of Computer Science, University College London, Gower Street,. London ... search Training Network (5th Framework): “Using mathematical modelling and computer simulation ..... system attempts to maintain.

pdf-25\dragon-multinational-a-new-model-of-global-growth-by-john ...
Connect more apps... Try one of the apps below to open or edit this item. pdf-25\dragon-multinational-a-new-model-of-global-growth-by-john-a-mathews.pdf.

COTC02: a cotton growth simulation model for global ...
growth, development, yield and water use has been constructed. It is named ..... ENVIRN. DARCY. Darcian flow of water among soil cells (surface energy.

Habit formation in a monetary growth model
The steady state solutions of this model are contrasted with the basic .... dr a(p 1r)u. Another important characteristic of this demand for money function is that the ...