Paying for whose performance? Teacher incentive pay and the black-white achievement gap

Andrew Hilla University of South Carolina Daniel B. Jonesb University of South Carolina

Abstract. How do individual-level teacher incentives impact the black-white achievement gap? We present a simple theoretical model predicting that performance pay may exacerbate the gap, and draw on administrative education data from North Carolina to test this prediction. Using a difference-in-differences approach, our estimates suggest that performance pay considerably increases the conditional black-white gap. This is especially true in classrooms where a majority of students are white, which we interpret as consistent with teachers responding to performance pay by allocating additional effort towards the students they (possibly mistakenly) perceive as high ability in order to increase average class achievement.

___________

Hill: University of South Carolina, Darla Moore School of Business, Department of Economics, 1014 Greene St., Columbia, SC 29208. Email: [email protected]. b Jones: University of South Carolina, Darla Moore School of Business, Department of Economics, 1014 Greene St., Columbia, SC 29208. Email: [email protected]. a

1

1. Introduction Performance pay for teachers is becoming increasingly common both in the United States and elsewhere. Performance pay in education is often introduced with the goal of minimizing inequity. Congress approved the “Teacher Incentive Fund” (TIF) in 2006 to provide funds to school districts to “develop, implement, or improve comprehensive performance-based compensation systems for teachers and principals, especially for teachers and principals in high-need schools, who raise student academic achievement and close the achievement gap”.1 Through TIF, from 2006 to 2012, the federal government awarded $1.8 billion in grants to 131 school districts to establish performance-based pay programs in high-need schools. A growing body of literature assesses the impact of such programs on average student achievement (discussed in more detail below), but – despite such large investments from the federal government with the goal of closing the achievement gap – very little is known about the differential impact of performance pay on particular groups of students. With that in mind, we ask: how do individual-level teacher incentives impact the black-white achievement gap? We focus on the black-white gap for two reasons. First, it has proven to be one of the most persistent phenomena – and one of the most difficult problems to address through policy – in American education.2 As recently as 2015, black students’ scores on National Assessment of Education Progress (NAEP) tests remained between 0.7-0.75 standard deviations lower than white students’ scores (depending on the grade level and subject being tested).3 Existing research has found that the gap cannot be entirely explained by student characteristics correlated with race, nor can it be explained by differences in school and teacher quality (e.g., Fryer & Levitt, 2006). This may explain how the gap has persisted despite billions of dollars each year being funneled into schools with high shares of black students through the Title I program.4 Performance pay represents a different policy approach, aiming to directly increase the quality of teaching within high-need schools. Second, how teachers shift effort in response to performance pay may depend on teachers’ expectations of students, and recent research finds that teachers may have systematically lower expectations of black students. Burgess & Greaves (2013) draw on administrative data from the UK Emphasis is ours. See Hedges & Nowell (1998) for an overview of the evolution of the gap since 1965 and Ferguson (1998) for an overview of attempts to minimize the gap through policy. 3 Based on authors’ calculations of NAEP scores. 4 As articulated in the 2001 reauthorization of the Elementary and Secondary Education Act, the goal of Title I is: “closing the achievement gap between high- and low-performing children, especially the achievement gaps between minority and nonminority students”. 1 2

2

where they observe both students’ test scores and teachers’ assessments of students’ ability levels. Even with a rich set of controls (including students’ prior test scores, poverty status, etc.), they find that teachers are significantly more likely to underestimate black students – reporting a subjective assessment that is lower than the students’ actual performance in the relevant test.5 In Section 2, we present a simple theoretical framework to consider how performance pay impacts student outcomes in an environment where teachers must choose both a level and allocation of effort across students. Specifically, because performance pay is paid based on the average of student achievement, teachers can attempt to increase average achievement either by uniformly improving the scores of all students or by focusing efforts on some subset of the class and pushing for large gains among those students. If teachers indeed have systematically lower expectations of black students (as in Burgess & Greaves (2013)), performance pay may lead teachers to focus additional attention on white students. This, in turn, would exacerbate the black-white achievement gap. Our empirical analysis is aimed at testing this possibility. To assess the effect of performance pay on the achievement gap, we take advantage of rich administrative data from North Carolina and study the impacts of teacher incentive pay programs introduced by several districts in the state during the 2000s. These programs are typical of programs adopted throughout the US in that they provide individual teachers with bonuses (in some cases up to $12,000 a year) if their students achieve some threshold level of average achievement or growth. Our data include the universe of teachers and students in North Carolina in a panel, allowing us to employ a difference-in-differences approach. We find that performance pay has no impact on average student achievement. However, when we allow for an interaction between the treatment effect of performance pay and the race of the student, we find evidence in line with our theoretical prediction: performance pay widens the black-white achievement gap. This is true both in simple models with minimal controls and models with richer controls correlated with race that could be explaining the result. In additional analyses, we probe the mechanism driving our main result. First, we consider how our results vary with the racial composition of the class. If teachers are “targeting” effort at white students with the goal of increasing their scores sufficiently to bring the average of class achievement above the threshold, then there must be enough white students in the class to have a In a more recent paper, Gershenson et al. (2016) take advantage of a survey in the US which asks two teachers to report their expectations for a single student. Holding student characteristics constant (through student fixed effects), they document that white teachers hold systematically lower expectations of black students than black teachers do; the same is not true of white students. 5

3

meaningful impact on the average. If not, teachers may perceive the bonus as too costly to pursue. Indeed, in our data we find that our results are driven by classes where a majority of students are white. In classes where a majority of students are black, we observe no impact of performance pay on either raw levels of achievement or the differential change in achievement for black and white students. Interestingly, the fact that performance pay exacerbates the black-white achievement gap is true both when the teacher is white and when the teacher is black. However, the effect is smaller when the teacher is black, which would be expected if black teachers are less likely to have lower expectations of black students (Gershenson et al., 2016). Relatedly, the bonus must be sufficiently large to make extra effort worthwhile (even if that effort is targeted at a particular group of students). We take advantage of variation in the size of bonus available across districts (and across time within districts in some cases) to test this empirically, finding that performance pay exacerbates the black-white gap specifically when teachers are eligible for relatively large bonuses. Our results contribute to a growing literature on performance pay in education, which has mostly focused on average effects of performance pay. Early papers find some evidence of a positive effect of performance pay on student achievement (Eberts et al., 2000; Figlio & Kenny, 2007). Springer et al. (2010) report results from an experiment in the Nashville school system in which some teachers (who volunteered to participate in the experiment) were randomly assigned to participate in individual-level performance pay and other volunteer teachers were randomly assigned to a control group. They find no average effect of performance pay, which is consistent with our finding of no effect of performance pay on average.6 Other papers in the literature study programs with distinct design features that make them different the programs studied in this paper, providing mixed evidence on the efficacy of performance pay in education.7 Lavy (2009) conducted a randomized experiment in Israeli schools in which teachers participated in a within-school tournament with a monetary prize. The tournament ranking is based on a composite measure of (1) the average score within the class (not conditioned on prior test We explore this in more detail in a separate paper (Hill & Jones, 2016). Atkinson et al. (2009) study reforms in the UK that provide periodic permanent pay raises conditional on “substantial and sustained” contributions to the school (as opposed to the one-off bonuses typical of the programs studied here). They find that the value-added of teachers eligible to benefit from the reform increased “by 40% of a grade per pupil,” but they find no evidence that this differentially impacted students at different points on the achievement distribution. Sojourner et al. (2014) study a program in Minnesota that introduced a suite of reforms, including individual performance pay (primarily based on subjective classroom observations) and school-wide performance pay. They conclude that “P4P-centered HRM reform raises students’ achievement by 0.03 standard deviations.” Other papers study purely group-based incentive programs: Goodman & Turner (2013) and Fryer (2013) find no evidence that a group incentive program has an impact on student outcomes. Outside of the US, Lavy (2002) and Glewwe et al. (2003) document positive impacts of group incentive schemes on student outcomes. 6 7

4

scores or “value-added”) and (2) the pass rate within the class; the pass rate received more weight in determining the tournament’s winner. Ultimately, he finds a strong positive effect of the tournament pay scheme. The effects are particularly strong for students who are lower in the ability distribution. Moreover, teachers who were assigned to the tournament condition were more likely to report exerting additional effort on “weak students” in particular. These results suggest a pattern that is different from our empirical findings, but potentially reveal the same underlying mechanism. Teachers exert effort where it has the greatest return; for the policy in Israel, because incentives were tied to pass rates, this was low-achieving students at risk of failing. A similar incentive is completely absent from the programs introduced in North Carolina. Outside of teacher performance pay, it has been argued that No Child Left Behind (NCLB) school accountability reforms targeting pass rates have similar effects. Reback (2008) reports that low-achieving students disproportionately benefit from NCLB-related policy changes, while Neal and Schanzenbach (2010) suggest that these changes specifically induce teachers to shift attention to students who are near the current proficiency standard (and not too low or too high in the achievement distribution). In short, while performance pay has a clear positive impact outside of teaching8, its effects are less well understood when introduced in schools. We contribute to this literature by explicitly considering the possibility that – by rewarding teachers based on the average of multiple students’ scores – performance pay may have heterogeneous effects on different groups of students who contribute to the average. This could explain the mixed results in the literature: if performance pay has different effects on different types of students, then the composition of students within the class may dictate whether a positive average effect is detected or not. 2. Simple theoretical framework and hypotheses In this section, we describe a theoretical framework that frames the hypotheses we test in our empirical analysis. We begin by discussing a general framework that draws on Holmstrom & Milgrom (1991). Teacher incentives are simplified to focus on the salient elements of teacher performance pay in our context. We then discuss the interpretation of that framework in order to draw competing hypotheses to test empirically; we focus in particular on how the framework may relate to existing findings suggesting that teachers may hold systematically different expectations for students of different race groups. 8

See, for instance: Lazear, 2000; Paarsch and Shearer, 2000; Shearer, 2004; Bandiera et al., 2007

5

2.1 Theoretical framework We consider an environment where teachers choose a level of effort ti for each student i in their class.9 For each student, the teacher has expectations about how this effort is converted into gains (relative to pre-existing test scores) represented by some function 𝑓! (𝑡! ). The marginal returns from effort are positive but decreasing for all students, 𝑓!! 𝑡! > 0 and 𝑓!!! 𝑡! < 0. We assume that 𝑓! 𝑡! is student-specific, allowing the teacher to have different expectations about how effort is translated into achievement gains for different students. For simplicity, we assume that there are just two types of students: “High” marginal gain students (denoted by a subscript “H”) and “Low” marginal gain students (denoted by a subscript “L”) where 𝑓! 𝑡! > 𝑓! (𝑡! ) and 𝑓! ′ 𝑡! > 𝑓! ′(𝑡! ). Figure 1 depicts the relationship between teacher effort and (expected) gain for each of these groups. The teacher’s effort cost is a function of the total sum of effort provided to the class 𝑐 𝑇 where 𝑇 =

! ! 𝑡!

and 𝑐′ 𝑇 > 0. Prior to the introduction of performance pay, our starting point is

a simple employee objective function for teachers that in order to remain employed and receive !

wage w, the teacher’s average class improvement 𝑋 = ! vector of effort 𝑡 will minimize cost 𝑐( !

improvement requirement, so !

! ! 𝑓! (𝑡! )

! ! 𝑡! )

! ! 𝑓! (𝑡! )

must be at least X. The optimal

while exactly meeting the minimum average class

= 𝑋. Assuming an interior solution10, solving the teacher’s

cost minimization problem yields first order conditions 𝑓!! 𝑡! = ⋯ = 𝑓!! 𝑡! . The teacher simply chooses effort to equate the marginal returns from effort across her students. The cost associated with this effort level is denoted by 𝑐(

! ! 𝑡! )

= 𝑐. The solid blue (double arrow) lines in Figure 1

illustrate the optimal tangency conditions for the “High” and “Low” marginal return students. Notably, because 𝑓!! 𝑡 > 𝑓!! (𝑡) for any t, in order to equalize the marginal returns to effort, 𝑡! exceeds 𝑡! . That is, students with higher marginal returns to effort receive more attention even in the absence of performance pay.

A second issue worth discussing is what it means to allocate effort differently across students. An example of how this might occur is teachers providing more or better feedback to some students. There is some evidence that this happens: in an experiment, Taylor (1979) finds that student teachers provide briefer feedback in response to mistakes and less positive feedback in response to correct answers when they think the student is black. 10 A corner solution may exist in which the teacher chooses to allocate zero effort to some students. This will be the case when the marginal returns from any positive effort for these students are sufficiently small. 9

6

Performance pay is introduced into the school in the form of a cash bonus B that is paid to the teacher if her average class improvement exceeds some threshold 𝑋, where 𝑋 < 𝑋. Her objective is now to choose effort to maximize the utility function: 𝑢

=

𝑤 + 𝐵 − 𝑐(𝑇) 𝑤 − 𝑐(𝑇)

if 𝑋≥𝑋 otherwise 𝑋 ≤ 𝑋 < 𝑋

If the bonus is too small or the effort required to achieve the bonus is too high (perhaps because a teacher’s class is populated with a large share of “Low” marginal return types), then the teacher will not adjust her behavior and the predicted effort allocations will remain the same as in the world without performance pay. If the bonus exceeds the cost of the additional effort required to obtain the bonus, then the teacher will increase her effort to achieve the bonus. She will choose the vector of effort t* that exactly meets the bonus requirement,

! !

∗ ! ! 𝑓! (𝑡! )

= 𝑋, with similar first order

conditions on the marginal returns from effort across her students, 𝑓!! 𝑡!∗ = ⋯ = 𝑓!! 𝑡!∗ . Given the assumptions that 𝑓!! 𝑡! > 0 and 𝑓!!! 𝑡! < 0 for all students, the teacher will necessarily increase effort for all her students, so 𝑡!∗ > 𝑡! , implying 𝑓! (𝑡!∗ ) > 𝑓! (𝑡! ) and 𝑓!! 𝑡!∗ < 𝑓!! 𝑡! .11 All students receive more effort and show more academic improvement when performance pay is introduced. However, because the teacher equalizes the marginal returns to effort across students, students with higher marginal returns receive more additional attention, illustrated by the dotted red (triple arrow) tangency conditions in Figure 1. Note that while the teacher increases the effort directed at both “High” and “Low” students relative to the pre-performance pay regime (𝑡!∗ > 𝑡! and 𝑡!∗ > 𝑡! ), the increase in effort directed towards “High” students (and the corresponding increase in growth) is larger than it is for “Low” students (𝑡!∗ − 𝑡! > 𝑡!∗ − 𝑡! and 𝑓(𝑡!∗ ) − 𝑓(𝑡! ) > 𝑓(𝑡!∗ ) − 𝑓(𝑡! )). In other words, while performance pay may improve outcomes for both groups of students, the students perceived to be “High” types disproportionately benefit. 2.2 Interpretation of theoretical framework and resulting hypotheses This theoretical framework illustrates how performance bonuses tied to average improvement may lead to differential benefits for different groups of students. To relate this framework back to the empirical question of interest (performance pay and the black-white achievement gap), we must consider which group would be perceived to be “High” types and which group would be perceived to be “Low” types. Burgess and Greaves (2011) and Gershenson et al. 11

These inequalities may be weak for some students if there is a corner solution.

7

(2016) find that teachers have higher expectations of white students (even after conditioning on observables). Whether this implies that black students or white students will disproportionately benefit from performance pay is theoretically ambiguous and is therefore an empirical question. On the one hand, if teachers have higher expectations of white students than black students (even if inaccurate), then they may expect that black students have more room to improve and therefore generate higher marginal returns to effort. In this case, performance pay could actually reduce the size of the black-white gap. Moreover, as teachers (in our framework) only find it worthwhile to increase effort if the cost of reaching the bonus is not too high, under the assumption that black students are “High” marginal returns types, we would expect gains to occur especially in classrooms with a high share of black students. On the other hand, students with higher underlying ability may come into the course more prepared, requiring less review of previous material and able to move on to material for the current end-of-course exam more quickly; one unit of effort generates a larger gain on the relevant test for these students. Thus, if teachers have higher expectations of white students than black students (even if inaccurate), it may be white students that are perceived to be the “High” types. If this interpretation is correct, performance pay would actually exacerbate the black-white gap. By the same reasoning as above, in this case, it is more likely that we would observe some impact of performance pay in classrooms with a higher share of white students. Based on existing empirical work and patterns in our data, we anticipate that the second hypothesis – that performance pay exacerbates the black-white gap – is more likely. Ladd & Walsh (2000) document a positive correlation between school-level value-added and previous test scores despite the fact that value-added is intended to difference out previous test scores and account for differential initial endowments of knowledge across students. If the same is true at the student-level, then this implies that students with a higher base level of knowledge from previous years are better positioned to improve. Second, as we will show in a later section, we find that there is not just a black-white gap in test score levels (even conditioning on observables), but there is also a blackwhite gap in test score growth. We return to this point later in the paper. 3. Individual-level teacher incentives in North Carolina school districts The difference-in-difference strategy that we adopt exploits the introduction of performance pay in three school districts in North Carolina: Charlotte-Mecklenburg (program started in academic year 2007-2008), Guilford (program started in 2006-2007), and Cumberland (program started in 8

2007-2008), all of which received funding from the Teacher Incentive Fund.12 The Teacher Incentive Fund requires interventions to be targeted towards high need schools, where “high need” is typically defined on the basis of percentage of students eligible for free lunch, historical teacher turnover, and academic performance (as measured by state-designed end-of-grade and end-of-course tests).13 Only a subset of high need schools (34) are treated in each of the three districts: 10 schools in Charlotte-Mecklenburg, 5 schools in Cumberland, and 19 schools in Guilford implemented performance pay.14 Teacher performance pay in all of these programs is based on student performance on standardized tests; specifically, on the basis of value added computed from these test scores. North Carolina employs the widely-used SAS EVAAS methodology to determine teachers’ value added scores. Typically, value added measures the average difference between students’ scores on “end-ofgrade” standardized tests and their scores on similar tests taken the year before. In high school (which is the level of education we focus on), courses are not as clearly sequential as in earlier grades. Consequently, “value added” in North Carolina high schools is measured as the difference between a student’s actual end-of-course (EOC) test score and a predicted EOC test score. We elaborate on the construction of the “predicted EOC score” in the data section, but, in short, the North Carolina Department of Public Instruction uses previous cohorts of students to estimate the average predictive impact of Grade 8 math and reading end-of-grade (EOG) scores on the relevant high school EOC scores, and then uses these models to generate a predicted score for each current student based on their Grade 8 math and reading EOG test scores. The EOC tests are designed by the state and are the same for all students across districts in a given year. Not all courses have EOC tests; during the time period we study, high school students consistently take EOCs in Algebra I, Algebra II, Biology, and English I.15 Performance pay bonuses are therefore only available for high school teachers who teach one of these four subjects.

In total, four school districts in North Carolina adopted performance pay programs in the time period we study. However, in one of those districts (Winston-Salem/Forsyth), reforms impacts only elementary and middle schools. We focus on high schools in this paper, so high schools in that district are part of our “control” group. 13 This information came from discussions with the director of the performance pay initiatives in Guilford County Schools. Also note that the “end-of-grade” tests are subject-specific tests taken at the end of grades 3-8, whereas “endof-course” tests are subject-specific tests taken at the end of courses in grades 9-12. 14 Note that in Guilford and Charlotte-Mecklenburg the program started with a smaller number of schools and eventually expanded to the number of schools listed here. In our difference-in-difference analysis, we define treatment at the school-level; thus, even though performance pay is present in some schools in Guilford County in 2006-2007, a school that is not treated until 2008-2009 is not coded as treated until then. 15 Mansfield (2015) notes that some other courses also have EOC tests, but as these are not consistently administered (or at least consistently available in the data), they are excluded from the analysis. 12

9

Bonuses are paid to teachers who achieve some threshold level of value-added. Specific parameters (such as the size of incentivizes and the threshold level of value-added necessary to qualify) differ across the three districts, but in all cases this general incentive structure is the same. For instance, most teachers in Guilford County receive $2000 if their value-added is “1 standard error above the district mean” and $6000 if value-added is “1.5 standard errors above the mean.” In Charlotte-Mecklenburg, teachers were awarded a bonus if their value-added measure was in the top 30% in the district. In some analyses, we take advantage of variation (both across and within districts) in the maximum bonus available to teachers. The maximum possible bonus is $6,000 for non-math teachers and $12,000 for math teachers in Guilford County schools, $4,000 in Cumberland County schools, and, in Charlotte-Mecklenburg schools, the maximum size of the bonus shifted over time, starting at $2,800, increasing to $5,300 in 2010, and (slightly) decreasing to $5,000 in years thereafter. In the policy-affected schools in all three districts, recruitment incentives were introduced at the same time that performance pay was introduced. These incentives provided one-time bonuses to highly qualified new teachers filling hard-to-staff positions. The sorting of teachers into schools resulting from the recruitment incentives cannot be distinguished from the sorting that would have occurred purely in response to the performance pay programs (as teachers who prefer performance pay move into treated schools and teachers who do not move out of them). The overall effect of the set of policies is identified by looking at the overall change in students’ EOC test scores within a given school before and after the policy change, while we identify the incentive effects of performance pay from teachers who were present in the treated schools before and after the introduction of performance pay. We discuss this in more detail in the methodology section. 4. Data We use data on the population of students and teachers in public North Carolina high schools between 1997 and 2013 obtained from the North Carolina Education Research Data Center (NCERDC). The data have two key features that make them well-suited for studying individual teacher performance pay: first, we observe information that allows us to match students to their teachers, and, second, we observe the student EOC test scores used for evaluating teacher performance. The data also include rich student and teacher demographics, as well as school and district characteristics.

10

As discussed above, high school students take EOC tests in Algebra I, Algebra II, Biology, and English I during our study period.16 The analysis is therefore restricted to students we observe taking these subjects (“EOC students”). To ensure that treatment and control groups in our difference-in-differences design are comparable, the estimation sample is restricted to students from the types of schools that would be eligible for the studied performance pay programs in North Carolina. These are schools that are low-achieving (administratively defined by the share of students at grade level), schools that have a high teacher turnover rate (in the upper quintile), and schools that have a large share of students on free lunch (in the upper quintile).17 Given the focus of our analysis on the black-white achievement gap, we also restrict the sample to black and white students, dropping the relatively small share of students in North Carolina who are Hispanic, Asian or Native American (roughly 10%). Recall from the previous section that value-added for high school teachers is calculated in North Carolina as the average difference between students’ actual EOC scores and a predicted EOC score. To the extent possible, we construct predicted EOC scores following the procedure used by North Carolina schools. Specifically, we predict test scores for Algebra I, English I, and Biology based on normalized 8th grade end-of-grade Reading and Mathematics test scores (adding Algebra I test scores to the model for predicting Algebra II test scores). The cohort of students writing EOC tests in 2006 is used to estimate the parameters for the predictive model as this is the final year before any performance pay programs are introduced in North Carolina; student i’s test score in teacher j’s classroom in course c in 2006 is a linear function of his or her 8th grade Reading and Mathematics scores:18

Jackson (2014) restricts analysis to Algebra 1 and English 1 in his investigation of tracking using the same data, reasoning that these two courses have the most consistently administered EOC tests. We report results in which we both pool the four EOC courses together and consider the individual courses separately, showing that restricting the sample to Algebra 1 and English 1 would not affect the pattern of results. 17 While we do not have specific information on the thresholds used within each district to determine which schools receive performance pay, we know from conversations with the administrator of the performance pay program in one of the three districts that these three criteria (share of free-lunch eligible students, rate of teacher turnover, and school-level performance on state standardized tests) were used to select schools. To establish cutoffs in these criteria in order to identify control group scores in the rest of the state, we estimated linear probability models at the school-level in treated districts, estimating the likelihood that a school is treated as a function of these three criteria. The models revealed that schools coded in our data as “low-achieving” and schools in the upper quintiles of teacher turnover and share of freelunch eligible students were dramatically more likely to receive treatment than other schools. Thus, we use these as our cutoffs in establishing our control group. We do note however that our results are robust to simply using all schools in the state as a control group. 18 Models in which Reading and Mathematics test scores enter quadratically are also considered, but this barely affects the predicted score. 16

11

𝑦!"#,!""#

=

! ! ∝!" + 𝛽!" 𝑅𝑒𝑎𝑑!,!""# + 𝛽!" 𝑀𝑎𝑡ℎ!,!""# + 𝜀!"#,!""#

Using the estimated parameters ∝!" , 𝛽!" and 𝛽!" from the above regression, we obtain predicted scores 𝑦 in course c for students writing EOC tests in other years: 𝑦!"#$

=

! ! ∝!" + 𝛽!" 𝑅𝑒𝑎𝑑!" + 𝛽!" 𝑀𝑎𝑡ℎ!"

We use this predicted score as a control in our student-level analysis. Given that we take actual score as the outcome measure, we are implicitly assessing the impact of treatment on student gains, and, more specifically, the same student gains used by North Carolina schools to calculate teacher value added. We do note, however, that alternate specifications taking variations on lagged scores and predicted scores as controls yield very similar results, which are discussed in more detail in a later section. It is worth noting that although teachers are likely to know how a student performed on previous end-of-grade tests, they almost certainly do not know a student’s exact predicted EOC test score or the model used to compute it. The ability of teachers to accurately identify students who underperformed relative to potential in previous tests is therefore limited. Teachers may therefore place more emphasis on overall ability (rather than the ability gradient), and, most importantly for our analysis, may use other student characteristics to decide where to exert effort to maximize their value added. The set of high need schools similar to those eligible for treatment – the estimation sample – contains 280,115 unique high school students with EOC test scores in an average of 2.15 courses during our study period. Summary statistics describing the average characteristics of both the full sample and estimation sample are reported in Table 1. Most notably, students in the estimation sample of high need schools are considerably more likely to be black than the average high school student in North Carolina (60.4 percent in Column 2 in comparison to 26.1 percent in Column 1). On average, they also have less educated parents and are more likely to live in census block groups with higher poverty rates.19 As expected, the mean

Parent education is provided in the end-of-grade and end-of-course files. We take the highest parent education ever recorded for the student. Census block group poverty rates were merged into the data using census block group identifiers available in the NCERDC data. 19

12

EOC test score for students in high need schools is lower than the mean EOC test score for students in North Carolina (29.1 in comparison to 32.0 on a scale of 0 to 70). Within the estimation sample, there are striking differences in the characteristics of white students and black students despite being in the same set of high need schools. On average, black students in these schools have parents with much lower levels of education and are considerably more likely to live in poorer census block groups. The biggest difference, however, is in the mean EOC test score, which is 32.7 for white students and 26.8 for black students. The black-white achievement gap is stark even within the subset of high need schools analyzed in this paper. Given the standard deviation in EOC test scores is 9.0, the unconditional black-white achievement gap in our data is about 0.63 standard deviations, which is consistent with much of the established literature and the gap in NAEP scores we reported in the introduction.20 In a later section of the paper, we estimate the size of the black-white gap in our sample conditional on the covariates used in our main analysis. This gap is only partly attenuated when we look within classes (students taking the same course with the same teacher in the same year): 57.4 percent of white students are in the top half of the class, while the same is true for only 41.7 percent of black students. Overall, these differences highlight that black-white achievement gaps can only be partly explained by the types of schools black and white students attend, providing further impetus for considering whether there may be potential differences in within-school inputs for black and white students. The final element of the data is the matching of students to teachers. Constructing studentteacher matches is necessary because the teacher identifier in the student-level EOC test score file reflects the teacher who administered the EOC test rather than the teacher who taught the course (although these may be the same in some circumstances). Our algorithm for matching students and teachers draws heavily from Jackson (2014) and Mansfield (2015), two papers that use the same North Carolina education data that we do.21 The existence of a separate classroom-level (“activity”level) file containing the true teacher identifier and a set of classroom characteristics for every course taught by every teacher allows us to generate matches; students (and their individual EOC test scores) can be linked to teachers by matching reported classroom demographic composition from

This is consistent with Fryer & Levitt’s (2006) finding that the gap persists even when accounting for school fixed effects, suggesting that a large part of the gap occurs within schools and not entirely across high- and low-achieving schools. 21 We are grateful to both authors for making their matching code available through the replication archives of the journal in which these papers were published. The replication code from Rothstein (2010) was also very useful. 20

13

the classroom-level data with constructed classroom demographic composition from the studentlevel data. We discuss the full details of our matching procedure in an appendix. We note, however, that our algorithm leads to matches of a similar quality to Jackson (2014) and Mansfield (2015). 5. Methodology for Estimating Effect of Performance Pay Programs Our main analysis centers on a difference-in-differences estimation of the impact of performance pay on students’ EOC test scores. With student-courses as the unit of observation and pooling all EOC students, a simple representation of our estimating equation can be written as: 𝑌!"#

%$=

𝜃𝑇𝑟𝑒𝑎𝑡𝑒𝑑!" + 𝛾!" + 𝛿! + (𝛿! ×𝑡) + 𝜌𝑃𝑟𝑒𝑑𝑌!"#$% + 𝜋𝑋!"#$% + 𝜀!"#

%$𝑌!"#$% is the EOC test score of student i with teacher j in school s in subject c during year t. In all analyses that follow, the EOC test score is standard normalized (mean=0, standard deviation=1) so that we can interpret the magnitude of the treatment effect in units of standard deviations. 𝑇𝑟𝑒𝑎𝑡𝑒𝑑!" indicates whether an individual-level performance pay program was operational in school s during year t, 𝛾!" are subject-by-year fixed effects, 𝛿! are school fixed effects, (𝛿! ×𝑡) are linear school trends, 𝑃𝑟𝑒𝑑𝑌!"#$% is the predicted EOC test score of student i (used for the calculation of value added; this measure is also standard normalized), and 𝑋!"#$% is a vector of other student characteristics including gender, race and age. The parameter of interest is 𝜃, the effect of performance pay on EOC test scores. The subject-by-year fixed effects capture any time-varying subject-specific shocks such as changes in the difficulty of particular EOC tests. It is also important to consider achievement gaps within courses given the differences in average course-taking behavior between white and black students observed in Table 1; it may be the case that white students take systematically easier or more difficult courses. The school fixed effects control for any systematic differences between schools that implement performance pay and those that do not. With school fixed effects, the parameter of interest 𝜃 is the average gain in students’ EOC test scores in a given school before and after treatment. This captures both sorting effects (from either the recruitment bonuses or teachers moving in or out of schools due to the performance pay programs) and incentive effects (changes in teachers’ behavior in response to performance pay). This would provide an overall (albeit partial equilibrium) evaluation of the set of policies introduced in the districts.

14

Our focus, however, is on the performance pay element of the interventions. The primary model we consider therefore replaces school fixed effects 𝛿! with school-by-teacher fixed effects 𝛿!" , which eliminate the influence of recruitment incentives. The estimated treatment effects in this specification come from students of teachers who were in the same school before and after the introduction of performance pay. At the same time, of course, teacher-by-school fixed effects eliminate any impact from the sorting component of performance pay. This is a concern if we think that teacher sorting in response to performance pay affects the black-white achievement gap. Ultimately, we show that the effect of performance pay on the achievement gap is similar in the two specifications (school fixed effects in comparison to teacher-by-school fixed effects), suggesting that the findings are driven by incentive effects rather than the sorting effects or teacher composition. Finally, the model also includes predicted EOC test scores. Specifically, this controls for any differences between treated and non-treated students in performance on past standardized tests, although loosely we may think of this as controlling for student ability. Estimated effects may therefore be interpreted as gains, which aligns with the measure of value added that is actually incentivized by the performance pay programs.22 So far, the model is set up to explore the overall effect of individual performance pay programs on student EOC test scores. The primary focus of this paper is to assess the impact of performance pay programs on the black-white achievement gap. We investigate this by interacting the performance pay indicator in the above model with an indicator for whether the student is black: 𝑌!"#

%$=

𝜃𝑇𝑟!" + 𝜎(𝑇𝑟!" ×𝐵𝑙𝑎𝑐𝑘!"#$% ) + 𝛾

!

+ 𝛿!" + (𝛿! ×𝑡) + 𝜌𝑃𝑟𝑒𝑑𝑌!"#$% + 𝜋𝑋!"#$% + 𝜀!"#

%$The coefficient on the performance pay indicator 𝜃 reflects the effect for white students, the coefficient on the interaction 𝜎 reflects the difference in the effect between black and white students, and the sum of the coefficients 𝜃 + 𝜎 reflects the effect for black students. We are primarily interested in whether performance pay attenuates or exacerbates the black-white achievement gap, which is dictated by whether 𝜎 is positive or negative. In a series of subsequent specifications, we probe how performance pay affects the blackwhite gap among various student subgroups, such as students with better or worse educated parents

One could argue that predicted score should be measured differently for black students and white students. For instance, the relationship between prior test scores and present test scores may be different for black students due to differential degrees of “summer slide”. To account for this, we have estimated models where “predicted EOC score” is fully interacted with “black” on the right-hand side of our main estimating equation. Results are very similar to the results presented in this paper. In fact, our main result (an expansion of the black-white gap in response to performance pay) is more pronounced in these alternative specifications. 22

15

or living in census blocks with above or below average poverty rates. This is to test whether characteristics that may be correlated with race rather than race itself may explain some of the patterns we observe. The results we report are from specifications with additional interactions to capture subgroup-specific effects, although splitting the sample provided similar estimates. 6. Results 6.1 Preliminaries: Estimating the black-white gap in our data Before proceeding to the main results, we first provide a fuller sense of the size of the blackwhite gap in our sample of high needs schools, conditioning on the same set of covariates we include in our main analyses. Our main goal in doing so is both to (1) provide a sense of the sources of the gap and (2) to estimate the magnitude of the conditional black-white gap in our sample, which can then be compared to our estimated effect of treatment on the gap. We simply regress students’ (standard normalized) EOC test scores on an indicator variable equal to one if a student is black and a sequentially growing set of covariates. The “black” indicator identifies the size of the gap. Results are reported in Table 2. Column 1 is essentially the parallel to the simple unconditional black-white gap discussed in the previous section. There, other than the “black” indicator, we include only year-by-subject fixed effects and find that the gap is roughly 0.6 standard deviations (similar to our mean comparison in the data section). Including school fixed effects (Column 2) or teacher-by-school fixed effects (Column 3) reduce the size of the gap, but not dramatically. The size of the gap is substantially reduced once we include controls for demographics and (importantly) predicted EOC scores (Columns 4 and 5); conditioning on prior achievement, gender, age, and grade-level, we observe a conditional black-white gap of roughly 0.06 standard deviations. This is not substantially altered by further including controls for neighborhood poverty rate and parental education level, controls that are not available for the full sample (Columns 6 and 7).23 It is worth noting that with the inclusion of predicted scores on the right hand side of the regressions reported in Table 2 (Columns 4-7), we are essentially estimating a conditional blackwhite gap in test score gains or a black-white value-added gap.24 That is, black students not only have lower test scores, but also – at least in our data – experience smaller year-to-year gains. If, as would In Appendix Table 1, we repeat these same specifications on a sample consisting of just schools that are never treated throughout the entire panel (Panel A) and a sample consisting of all schools prior to 2007, the earliest treatment year (Panel B). 24 We thank an anonymous referee for emphasizing this point. 23

16

be true under a model of statistical discrimination, teachers expect this average value-added gap to be true of the average black student, then – referring back to our simple theoretical framework – teachers may expect black students to generate lower marginal returns to effort, and therefore may differentially increase effort directed towards white students once performance pay is introduced. Having established the size of the black-white gap conditional on the same covariates used in the main analysis, we now turn to the effect of performance pay. 6.2 Main results Table 3 reports the main results of the paper: estimates of 𝜃 (the impact of performance pay on white students), 𝜎 (the differential impact of performance pay on black students, and therefore on the black-white gap) and 𝜃 + 𝜎 (the impact of performance pay on black students). The first three columns include school fixed effects, providing estimates of the overall effect of reforms on student achievement (driven both by changes in the composition of teachers in treated schools and changes in the behavior of teachers in response to performance pay), while the second three columns include teacher-by-school fixed effects, isolating the incentive effects of the performance pay programs. Consistent with recent evaluations of teacher performance pay discussed in the introduction (e.g., Springer et al., 2010), the first column indicates that there is no overall impact on student achievement from the set of policy interventions. However, Column 2 shows that the overall effect masks heterogeneity in the effect by student race. White students’ scores increase by 0.042 standard deviations, while the black-white gap expands by 0.052 standard deviations. (This in turn leads to an effect on black students of -0.010, which is not statistically different than zero.) Furthermore, controlling for the poverty rate of the census block in which the student lives and parent education only marginally reduces the magnitude of the effect on the black-white gap (-0.047 in Column 3). One reason these additional controls do not have a large impact in our analysis may be because our estimation sample is restricted to students in high need schools, which would mitigate some of the differences in poverty rates and parental education between average white and black students in North Carolina. The results in Columns 4 to 6 show that the effect on the achievement gap is unchanged when we eliminate effects operating through teacher composition. With teacher-by-school fixed effects (using identifying variation from students of teachers who were in the same school before and after treatment), the black-white achievement gap grows by between 0.045 and 0.056 standard deviations when performance pay is introduced. Given that our estimate of the black-white gap 17

conditional on the same covariates employed in Columns 5 and 6 was roughly 0.06 standard deviations, our estimates suggests that performance pay nearly doubles the size of the gap. Notably, the results in Columns 1 through 3 and Columns 4 through 6 are very similar, suggesting that most of the effect of performance pay is coming through changes in the behavior of teachers rather than a change in the composition of teachers. For this reason (and to strip away the impact of sorting caused by recruitment incentives), we employ teacher-by-school fixed effects in all remaining results. In the main results presented so far, we control for student ability using the “predicted score” measure similar to the one constructed in North Carolina for the purposes of calculating value-added. This is done so that we can assess the impact of performance pay on the incentivized measure (the gain in student scores relative to their predicted score). However, our results are not dependent on this particular measure for prior achievement. In Appendix Table 2, we replicate our most robust specification (Column 3, Table 3) using 16 variations on measures for prior achievement. In addition to predicted scores constructed using math and reading scores from 8th grade in Panel A (the original measure), we construct a second predicted score constructed using all scores from grades 6 through 8 in Panel B, and, in panels C and D, respectively, we directly control for the vectors of lagged 8th grade and 6-8th grade scores rather than transforming them into a single predicted score. We allow the predicted scores or vectors of lagged scores to enter linearly (Column 1), linearly but fully interacted with race (Column 2), as a set of dummies indicating the decile in which the student falls for each measure (Column 3), and, finally, as a set of decile dummies fully interacted with the race of the student (Column 4). Columns 2 and 4 allow for the possibility that lagged achievement translates into current achievement differently for students of different races (or that teachers perceive this to be true), and Columns 3 and 4 allow current achievement to be a flexible nonlinear function of prior achievement. Across all of these specifications, the results are similar to our main result. We also estimate specifications with class-level fixed effects (rather than school or teacherby-school fixed effects). If results are simply driven by some interaction between pay-forperformance and non-random assignment of students to teachers, we should not observe a result within the classroom. If the results are driven (at least to some degree) by differential effort allocated towards particular students within a class, we should observe an expansion of the black-white gap even within classes. Results are reported in Appendix Table 3. As in Table 2, we find that

18

performance pay has a clear negative differential effect on black students, even when relying on within-classroom variation. Overall, the pattern of results in Table 3 (and Appendix Tables 2 and 3) provides the stark message that teacher performance pay programs may fail in their objective of narrowing achievement gaps between minority and nonminority students. Notably, however, we do not find that performance pay leads to worse outcomes for black students. Instead, in most specifications, performance pay has a weak positive effect for white students (𝜃 > 0), a negative differential effect for black students (𝜎 < 0), and – as a result – no detectable direct treatment effect on black students (𝜃 + 𝜎 = 0). While we probe the mechanisms driving this result in a later subsection, it is worth noting that this is consistent with the simple theoretical framework presented above. The theoretical framework does not suggest that teachers simply shift effort away from black students and towards white students (in which case we should expect a clear negative impact for black students). Instead, under performance pay, teachers increase effort for all students, but do so more for students they expect will generate higher marginal returns. 6.3 Robustness of the main result In this subsection we present several tests aimed at assessing the robustness of our main result to concerns standard to the application of the difference-in-differences methodology. First, performance pay programs may be targeted at schools with declining student achievement if it was thought that performance pay could arrest the decline. Ignoring a declining trend in student EOC test scores in the treated schools would bias the estimated effect downwards. On the other hand, performance pay programs may be introduced into schools that are on an upward trajectory if the school improvement is caused by better administrators, and these administrators are also more likely and willing to incorporate new ideas. We address these concerns in several ways. First, we estimate a modified version of our model wherein we include school-specific linear time trends. The parameter 𝜃 is now identified by deviations from school trends in average student achievement. Results, reported in Appendix Table 4, are essentially unaffected by the inclusion of trends. Second, also in the appendix, we report the results of a simple balance test. We estimate a series of school level difference-in-differences style regressions, testing whether the introduction of treatment is associated with changes in a variety of relevant school-level observables: student race 19

composition, teacher race composition, average poverty rate, average student parental education, and share of students eligible for free/reduced price lunch. The balance tests (Appendix Table 5) reveal no clear shifts in these variables occurring at the same time as treatment. Third, we modify our model in a more flexible manner: Specifically, we replace the “treated” indicator with indicators that are equal to one if a student is in a school that will be treated in 1-2 years (“1-2 years pre-treatment”), became treated in the current or previous year (“0-1 years posttreatment”), or became treated two or more years ago (“2+ years post-treatment”). Observations wherein students are in a school that will become treated in three or more years (“3+ years pretreatment”) serve as the omitted category. Just as we fully interacted our main “treatment” indicator with “black”, we fully interact the vector of time to/since treatment dummies with “black” so that we can interpret the dynamic effects of performance pay on the black-white gap. Equally importantly for the present purposes, we can assess whether schools that will shortly introduce treatment have a noticeably different black-white gap in the years leading up to treatment. If this is the case, concerns around non-random selection of schools and pre-treatment trends would be warranted. We present the results from our estimation of this more flexible model graphically in Figure 2. We plot the main effect of each “time to/since treatment” indicator (which indicates estimated effect on white students), the interaction of the “time to/since treatment” indicator with black (which indicates the estimated effect on the black-white gap), and the linear combination of the two (which measures the estimated effect on black students). As in the main specification, we are most interested in the effect on the black-white achievement gap. Those coefficients are represented by the diagonally-shaded black and white bars in Figure 2. First, note that the estimated black-white gap is not statistically different in schools that will receive treatment in 1-2 years relative to schools that are further from receiving treatment and schools that never receive treatment. The same is true of the estimated effects on white (white bar in Figure 2) and black (dark gray bar in Figure 2) 1-2 years prior to treatment. There is no clear evidence of differential trends in treated schools immediately prior to treatment. 25 We further test the robustness of our results by comparing our estimate of the effect of performance pay on the black-white achievement gap to placebo estimates.26 Specifically, we repeat The figure also reveals that the effect on the black-white gap is relatively persistent. The effect of the performance pay on the black-white gap is different than zero at a 0.01 level of significance both 0-1 years after treatment and 2 or more years after treatment begins. 26 See Jackson and Schneider (2015) for an example. 25

20

the following process 100 times: First, we randomly choose 34 schools to consider as “treated”, not allowing the 34 schools that were actually treated to be chosen. Next, for each of those “placebo schools”, we randomly draw a “placebo treatment start year” and code each school as treated starting in the randomly selected year. We then estimate our main estimating equation with placebo treatment rather than actual treatment. Figure 3 plots the distribution of coefficients on “treated X black” (the effect on the black-white gap) from the 100 randomly assigned placebo treatment estimations we performed. The black bar in Figure 3 indicates our actual estimate of the treatment effect of performance pay on the black-white gap. As expected, the placebo estimates are approximately normally distributed and the actual estimate is more negative than 92% of the placebo estimates. We also report the results from a variety of other robustness checks in the appendix. First, we assess whether our results are driven by a particular school district. In Appendix Table 6, we repeat our main specification, but in each column we drop one district at a time. Cumberland, Guilford, and Charlotte-Mecklenberg schools are dropped in Columns 1, 2, and 3 respectively. Results are similar to the main results.27 Second, we report results from a sample using only years after 2002 to ensure that our estimates are not contaminated by policy changes arising from the implementation of No Child Left Behind in 2002. These results, which are similar to the main results, are reported in Appendix Table 7. And, third, we also estimate a much more flexible version of our main specification in which we fully interact race with all controls. Again, results (reported in Appendix Table 8) are similar. Finally, the set of students taking the EOC test may change with treatment. Teachers may have an incentive to exclude students who are expected to underperform from their class (or at least arrange for their EOC test to be conducted in a different way that would not count towards valueadded). We test whether – for a given teacher-by-year observation – (1) the number of students taking the EOC test changes with treatment and (2) whether the share of EOC test takers who are black changes with treatment. In both cases, we detect no effect of treatment.28 6.4 Probing the mechanism We lose some precision when we drop Guilford County schools: the effect of performance pay on the black-white gap is negative and roughly similar in magnitude to the main estimate, but is significantly different than zero with a p-value of 0.11. This, however, should be expected; of the sample of students in schools that are ever treated, roughly 66% of observations are from Guilford County schools. Removing those observations removes a large fraction of the identifying variation. 28 Results are available upon request. 27

21

We have documented that performance pay has a differential effect on black students. There are a variety of potential explanations for this result. One explanation, as put forward in the introduction and theoretical framework, is that teachers use race as a “shortcut” to form expectations about student ability (as in the statistical discrimination literature and Burgess & Greaves (2013)), and, in turn, focus their efforts on white students who they may perceive to have higher ability on average. On the other hand, it is possible that the effect is driven by characteristics correlated with race rather than race itself. Table 1 documents clear differences between black and white students with respect to prior achievement (predicted EOC score), neighborhood poverty rate, and parental education; each of these characteristics could potentially explain drive the patterns observed in Table 3.29 In Table 4, we interact the treatment indicator with these characteristics (prior achievement, neighborhood poverty rate, and parental education) and show that there is, in fact, heterogeneity in the treatment effect along some of these dimensions as well (especially parental education (Column 1) and predicted score (Column 3)). The question that remains is whether these factors explain the heterogeneity in treatment effects by race, or whether the race effect persists after accounting for heterogeneity in these other factors. While we control for prior achievement in all specifications in Table 3, and also control for parental education and neighborhood poverty rate in some of the specifications in this table, we did not explicitly allow for the possibility that there may be heterogeneity in the treatment effect by these other characteristics. We do so in this section, providing a more stringent test of the hypothesis that performance pay has a differential effect specifically related to the race of the student rather than characteristics correlated with race. In Table 5, we report the results of four additional specifications in which we fully interact “black,” “treated,” and one other correlated characteristic in each regression. The correlated characteristics we consider are: (1) dummy variables indicating parental education level (reported in Panel A), (2) dummy variables indicating whether the student’s predicted score is above or below the sample median (in Panel B), (3) dummy variables indicating whether the student’s predicted score is above or below their class median (in Panel C), and (4) dummy variables indicating whether the

For instance, it may be that teachers target students they expect to be high ability based on their prior achievement (rather than race) once performance pay is introduced; given the correlation between race and prior achievement, this could appear as a differential effect for black students. Alternatively, it may be that teachers do not target at all, but instead that the treatment only “works” for students whose parents’ are highly educated (perhaps because of an increased focus on at-home preparation for tests, which may require engaged parents). 29

22

students’ block group poverty rate is above or below the sample median (in Panel D).30 This allows us to test whether the differential black-white effect persists even within particular subsets of students (e.g., students with high predicted scores), or if allowing for further heterogeneity in the treatment effect eliminates the black-white difference. In every specification in Table 5, performance pay has a significantly negative effect on the black-white achievement gap (σ<0). Overall, the results in Table 5 indicate that the widening of the black-white achievement gap from performance pay programs occurs in every considered subgroup (high predicted score and low predicted score, high parental education and low parental education, etc.) and is therefore not simply driven by other individual and family background characteristics. It appears as if it is the actual race of students that matters. An alternative approach to answering the same question is to fully interact the treatment indicator with race and other characteristics (poverty status, parental education, etc.) in a single specification and assess whether the heterogeneity in treatment effect by race survives. We adopt this approach in specifications reported in Appendix Table 9; in that table, we continue to observe a differential negative effect of performance pay on black students. Our results point to the possibility that teachers may discriminate based on race in choosing where to exert additional effort when performance pay programs are introduced. Under the assumptions of the theoretical model discussed in Section 2, performance pay widens the achievement gap between students who teachers perceive as having more and less capacity to convert teacher effort into output. The widening of the black-white gap we find suggests that teachers, on average, believe that white students have more capacity to convert teacher effort into achievement gains than black students. We cannot provide definitive support of this claim – there are other possible explanations – but, in the remainder of this section, we probe the “teacher targeting” hypothesis by considering whether results are sensitive to the race composition of the class, the race of the teacher, and the size of the bonus. The theoretical framework suggests that if the share of students perceived to have less capacity to convert teacher effort into output is relatively high, teachers will no longer consider it worthwhile to exert the additional effort required to obtain the performance bonus. If teachers’ expectations are correlated with race, then this implies that any widening of the black-white gap should be more likely to occur in classrooms where black students are in the minority. In those Results are similar if we simply split the sample by the characteristics listed here and test for a black-white achievement gap within the smaller selected samples. 30

23

classrooms, teachers may perceive that there are a sufficient share of “high achieving” students to have a meaningful impact on the average achievement. To test this prediction, we fully interact our treatment indicator, the black student indicator, and the interaction of these two indicators with a third variable equal to one if the majority of students in a class are black (and zero otherwise). Results are reported in Panel A of Table 6. Strikingly, we find that when the class is majority black, we do not observe an effect on the blackwhite gap (-0.009 sd's). This is consistent with the prediction of teachers not aiming for performance bonus in these classes. As would then be expected, the second column shows that the widening of the black-white achievement gap comes from classes that are not majority black (-0.061 sd's). This result is not sensitive to the choice to divide the sample into majority black and non-majority black classrooms. In Appendix Table 10, we report similar results for other cutoffs in the black share of the class (20%, 40%, 60%, or 80% black). In a related test, Panel B reports results from specifications allowing the treatment effect (and differential treatment effect for black students) to vary by the race of the teacher. This panel shows that the effect on the black-white achievement gap is evident for both black and white teachers. It is, however, smaller for black teachers, which may be due to black teachers not underestimating the ability of black students to the same extent that white teachers do (as documented by Gershenson et al. (2016)).31 Finally, as also suggested by our model, in order for teachers to find it worthwhile to increase effort in an attempt to receive a bonus (seemingly by targeting effort towards white students), the bonus must be sufficiently large to warrant the added effort. Empirically, we may therefore expect that teachers’ increased effort (and the corresponding expansion of the black-white gap) is more likely to occur when the bonus is large. Due to differences in program parameters across districts and – in some cases – across time within districts, we observe six different maximum possible bonus amounts. We call the highest three of those “large”. The three large bonuses are greater than $5,000. Bonuses of $5,000 or smaller are called “small”. We then allow treatment (and the differential effect of treatment on black students) to vary by whether the bonus is large or small.

Appendix Table 11 reports results from specifications where we fully interact treatment, student race, teacher race, and a dummy indicating that a class is majority black. We do so because there may be correlations between teacher’s race and the race composition of the classroom, which could provide an alterantive explanation for the results from Table 6. Appendix Table 11 shows that the conclusions we have drawn are not dramatically altered by allowing for a full set of interactions. 31

24

Table 7 reports the results. Results are primarily driven by teachers eligible for a “large” bonus, which is consistent with our hypothesis.32 7. Conclusion In this paper, we ask: how does performance pay impact the black-white achievement gap? Prominent models (e.g., Lazear, 2000) and empirical work on performance pay outside of the context of education (Lazear, 2000; Paarsch and Shearer, 2000; Shearer, 2004; Bandiera et al., 2007) have all pointed towards the possibility that performance increases workers’ efforts and productivity. It may seem reasonable, then, to expect a similar increase in teacher effort (and, in turn, student achievement) when performance pay is introduced in schools. As a result, if performance pay is almost exclusively introduced in schools with a large share of minority students (as is the case through the Teacher Incentive Fund), the achievement gap could shrink. In the types of settings studied in existing papers on performance pay (e.g., tree-planting, glass installation), it is relatively easy to measure, observe, and incentivize effort. For a variety of reasons, this may be less true of education, so the standard predictions may not apply.33 Most relevant for our paper is the fact that teachers must decide not just how much effort to exert, but also how to allocate effort across students. Because performance pay is based on average student achievement in our context, teachers might attempt to increase average achievement either by improving the scores of all students or by focusing efforts on some subset of the class and pushing for large gains among those students.34 Our theoretical model, accounting for this fact, predicts that teachers target their efforts towards the students that they believe are “high ability,” which – as suggested in recent research – may be white students. Even if performance pay is introduced to minimize achievement gaps and implemented exclusively in schools with a large share of minority students, it may cause teachers to target effort at white students, which, in turn, may widen the achievement gap in those schools. We find that this is the case. Taking advantage of rich administrative data from North Carolina and using a difference-in-differences strategy, performance pay has little impact on student achievement on average. There is, however, a large difference in the effect of performance on black In Appendix Table 12, we see that the same basic conclusion can be drawn by allowing the size of the bonus to enter linearly (rather than as a binary variable). 33 Neal, 2011; Podgursky and Springer, 2007 34 An example of how this might occur is teachers providing more or better feedback to some students. There is some evidence that this happens: in an experiment, Taylor (1979) finds that student teachers provide briefer feedback in response to mistakes and less positive feedback in response to correct answers when they think the student is black. 32

25

students (relative to white students); black students experience significantly smaller gains than white students under performance pay (and in some cases actually suffer as a result of the reforms). While we cannot firmly conclude that strategic targeting of effort (paired with systematic underestimation of black students’ ability) is the mechanism driving our results, some additional analyses provide suggestive support for this possibility. In particular, we find that the result occurs only in classrooms where a majority of students are white. This conforms to the targeting hypothesis, as – in order for targeting to be a worthwhile strategy – there must be a sufficient share of white students in the class to have a meaningful impact on the average gains that determine whether the teacher receives a bonus. These results should not be taken as evidence that performance pay, generally speaking, cannot have a positive impact on reducing achievement gaps. Instead, our results, combined with others in the literature, provide implications for how performance pay programs should be designed to meet the objectives of policymakers. Lavy (2009) finds that performance pay which explicitly incentivizes increasing achievement among the lowest performing students – rather than just incentivizing average achievement – leads to larger gains and more attention from teachers among those students. While these findings may at first seem at odds with our findings, they, in fact, reinforce the central message of this paper: Performance pay in schools impacts not just the level of effort that teachers exert, but also impacts the allocation of effort among students. When average achievement is incentivized (as is common in the US and in the programs we study), teachers may – and, in this paper, seem to – target students perceived to be high ability; when achievement of marginal students is incentivized (as in Lavy (2009)), teachers instead target those students. REFERENCES 1. Atkinson, Adele, et al. "Evaluating the impact of performance-related pay for teachers in England." Labour Economics 16.3 (2009): 251-261. 2. Bandiera, Oriana, Iwan Barankay, and Imran Rasul. "Incentives for Managers and Inequality among Workers: Evidence from a Firm-Level Experiment." Quarterly Journal of Economics 122.2 (2007): 729-773. 3. Burgess, Simon, and Ellen Greaves. "Test scores, subjective assessment, and stereotyping of ethnic minorities." Journal of Labor Economics 31.3 (2013): 535-576. 4. Eberts, R., K. Hollenbeck, and J. Stone. "Teacher performance incentives and student outcomes." Journal of Human Resources (2002): 913-927. 5. Ferguson, R. F. (1998). Can schools narrow the black-white test score gap? The Black-White test score gap, 318, 374.

26

6. Figlio, David N., and Lawrence W. Kenny. "Individual teacher incentives and student performance." Journal of Public Economics 91.5 (2007): 901-914. 7. Fryer, Roland G., and Steven Levitt. "The black-white test score gap through third grade." American Law and Economics Review 8.2 (2006): 249-281. 8. Fryer, Roland G. "Teacher Incentives and Student Achievement: Evidence from New York City Public Schools." Journal of Labor Economics 31.2 (2013): 373-407. 9. Gershenson, Seth, Stephen B. Holt, and Nicholas W. Papageorge. "Who believes in me? The effect of student–teacher demographic match on teacher expectations." Economics of Education Review 52 (2016): 209-224. 10. Glewwe, Paul, Nauman Ilias, and Michael Kremer. "Teacher Incentives." American Economic Journal: Applied Economics (2010): 205-227. 11. Goodman, Sarena F., and Lesley J. Turner. "The design of teacher incentive pay and educational outcomes: Evidence from the New York City bonus program." Journal of Labor Economics 31.2 (2013): 409-420. 12. Hedges, L. V., & Nowell, A. (1998). Black-White test score convergence since 1965. The BlackWhite test score gap, 149-181. 13. Hill, A. J. and Jones, D. (2016) Do teachers respond to individual-level performance pay programs? An evaluation of overall effects and gender differences. Working Paper 14. Holmstrom, Bengt, and Paul Milgrom. "Multitask principal-agent analyses: Incentive contracts, asset ownership, and job design." Journal of Law, Economics, & Organization 7 (1991): 24-52. 15. Jackson, C. Kirabo. "Teacher Quality at the High School Level: The Importance of Accounting for Tracks." Journal of Labor Economics 32.4 (2014): 645-684. 16. Jackson, C. Kirabo and Henry S. Schneider. 2015. "Checklists and Worker Behavior: A Field Experiment." American Economic Journal: Applied Economics, 7(4): 136-68. 17. Lavy, Victor. "Evaluating the effect of teachers’ group performance incentives on pupil achievement." Journal of Political Economy 110.6 (2002): 1286-1317. 18. Lavy, Victor. "Performance Pay and Teachers’ Effort, Productivity, and Grading Ethics." American Economic Review 99.5 (2009): 1979-2011. 19. Lazear, Edward P. "Performance Pay and Productivity." American Economic Review 90.5 (2000): 1346-1361. 20. Mansfield, Richard K. "Teacher quality and student inequality." Journal of Labor Economics 33.3 Part 1 (2015): 751-788. 21. Neal, Derek. “The Design of Performance Pay in Education.” Handbook of the Economics of Education, Vol. 4, (2011), pp. 495-550. 22. Neal, Derek, and Diane Whitmore Schanzenbach. “Left Behind By Design: Proficency Counts and Test-Based Accountability.” Review of Economics and Statistics 92.2 (2010): 263-283. 23. Paarsch, Harry J., and Bruce Shearer. "Piece rates, fixed wages, and incentive effects: Statistical evidence from payroll records." International Economic Review 41.1 (2000): 59-92. 24. Podgursky, Michael J., and Matthew G. Springer. "Teacher performance pay: A review." Journal of Policy Analysis and Management 26.4 (2007): 909-949. 25. Reback, Randall. "Teaching to the rating: School accountability and the distribution of student achievement." Journal of Public Economics 92.5 (2008): 1394-1415. 26. Rothstein, Jesse. "Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement." Quarterly Journal of Economics 125.1 (2010): 175-214. 27. Shearer, Bruce. "Piece rates, fixed wages and incentives: Evidence from a field experiment." Review of Economic Studies 71.2 (2004): 513-534.

27

28. Sojourner, Aaron J., Elton Mykerezi, and Kristine L. West. "Teacher Pay Reform and Productivity Panel Data Evidence from Adoptions of Q-Comp in Minnesota." Journal of Human Resources 49.4 (2014): 945-981. 29. Springer, M.G., Ballou, D., Hamilton, L., Le, V., Lockwood, J.R., McCaffrey, D., Pepper, M., and Stecher, B. (2010). Teacher Pay for Performance: Experimental Evidence from the Project on Incentives in Teaching. Nashville, TN: National Center on Performance Incentives at Vanderbilt University.

28

TABLES Table 1. Descriptive Statistics North Carolina All students Demographics Black Female Parent ed <= HS Parent ed < 4-yr college Parent ed >= 4-yr college Census blk grp poverty rate High school course-taking Algebra I Algebra II English I Biology Achievement EOC score (standard deviation) Predicted EOC score Pred. score < Est. sample median Pred. score < Class median Student-course observations

Estimation sample (high need schools) Black and White students White students Black students

26.1 50.8 39.8 24.6 35.6 12.4

60.4 52.0 48.5 25.5 26.1 17.1

50.2 39.1 27.1 33.7 12.8

53.1 54.6 24.4 21.0 19.8

26.3 15.5 30.1 28.1

31.6 13.5 30.5 24.3

35.4 13.7 28.5 22.5

29.2 13.5 31.8 25.5

32.0 (9.6) 31.4

29.1 (9.3) 29.0 50.6 47.9 603,295

32.7 (9.4) 32.8 70.9 57.4 238,605

26.8 (8.4) 26.4 37.3 41.7 364,690

4,401,388

Table 2. Unconditional and conditional black-white achievement gap in estimation sample (1)

(2)

(3) (4) (5) (6) (7) Student EOC test score (standard normalized) Black -0.596*** -0.485*** -0.406*** -0.067*** -0.063*** -0.064*** -0.062*** (0.026) (0.017) (0.009) (0.004) (0.004) (0.004) (0.004) Year X Subj FEs X X X X X X X School FEs X X X Tch X School FEs X X X Demos. & pred. ach. X X X X Add’l controls X X Observations 596,805 596,805 596,805 596,805 596,805 494,955 494,955 R-squared 0.199 0.303 0.428 0.645 0.691 0.636 0.682 Regressions are at student-by-course level. “Demos. & pred. ach.” controls include: gender, grade level, age, and predicted EOC score. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

29

Table 3. Effect of individual performance pay programs on black-white achievement gap (1) P4P in school: θ

0.000 (0.021)

P4P in school X black: σ

(2) (3) (4) (5) Student EOC test score (standard normalized) 0.042* 0.045 -0.006 0.038*** (0.023) (0.035) (0.010) (0.011) -0.052*** -0.047*** -0.056*** (0.013) (0.015) (0.010)

(6) 0.033** (0.014) -0.045*** (0.012)

Post-estimation Effect on black students: θ + σ -0.010 -0.003 -0.018* -0.011 [p-value] 0.629 0.926 0.0980 0.299 School FEs X X X Tch X School FEs X X X Year X Subject FEs X X X X X X Additional controls X X Observations 596,805 596,805 494,955 596,805 596,805 494,955 R-squared 0.645 0.645 0.636 0.512 0.512 0.506 All regressions include controls for predicted EOC score, age, grade level, and gender. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

Table 4. Heterogeneity in treatment effect along other dimensions (1) (2) (3) (4) Student EOC test score (standard normalized) P4P in school -0.033** -0.026 -0.041** -0.010 (0.016) (0.019) (0.017) (0.025) P4P X Parent ed.: 0.019 >HS, < 4-yr. coll. (0.014) P4P X Parent ed.: 0.053** >= 4-yr. coll. (0.023) P4P X Top Half of Class 0.018 (Above med. Pred. score in class) (0.013) P4P X Above Med. Pred. Score 0.071*** (0.027) P4P X High Poverty N’hood -0.017 (0.012) Tch X School FEs X X X X Year X Subject FEs X X X X Additional controls X X X X Observations 494,955 494,955 494,955 494,955 R-squared 0.509 0.509 0.509 0.508 All regressions include controls for predicted EOC score, age, grade level, and gender. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

30

Table 5. Effect of P4P on black-white gap within subgroups (1)

Panel A: Treatment effect by parent education Performance pay program in school: θ Performance pay program in school X black: σ Post-estimation Effect on black students: θ + σ [p-value] Panel B: Treatment effect by predicted score Performance pay program in school: θ Performance pay program in school X black: σ Post-estimation Effect on black students: θ + σ [p-value] Panel C: Treatment effect by class standing Performance pay program in school: θ Performance pay program in school X black: σ Post-estimation Effect on black students: θ + σ [p-value] Panel D: Treatment effect by n’hood pov. Performance pay program in school: θ Performance pay program in school X black: σ

(2) (3) Student EOC test score (standard normalized) Parent ed ≤ Parent ed < Parent ed ≥ high school 4-yr college 4-yr college 0.009 0.050*** 0.062** (0.011) (0.014) (0.027) -0.034*** -0.064*** -0.025* (0.008) (0.014) (0.014) -0.026* 0.065 Pred. score < Est. sample median -0.012 (0.018) -0.035*** (0.010)

-0.015 0.252 Pred. score ≥ Est. sample median 0.054 (0.033) -0.031*** (0.005)

0.023 0.474 Pred. score < Class median 0.014 (0.020) -0.056*** (0.010)

-0.047*** 0.004 Pred. score ≥ Class median 0.020 (0.030) -0.028** (0.013)

-0.008 0.674 Census blk grp pov rate > median 0.021 (0.013) -0.055*** (0.011)

-0.042** 0.022 Census blk grp pov rate ≤ median 0.019 (0.030) -0.038*** (0.012)

0.037** 0.027

Post-estimation Effect on black students: θ + σ -0.034** -0.019 [p-value] 0.022 0.359 Each panel corresponds to one regression where the estimates in each column are from interactions with the treatment indicators. All regressions include controls from the main specification: teacher-by-school fixed effects, year-by-subject fixed effects, predicted EOC score, age, grade level, gender, census block poverty rates, and parent education. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

31

Table 6. Effect of P4P on black-white gap conditional on class composition and teacher race Panel A: Treatment effect by class comp. Performance pay program in school: θ Performance pay program in school X black: σ

(1) (2) Student EOC test score (standard normalized) Majority black class Not majority black class -0.006 0.069*** (0.016) (0.013) -0.009 -0.061*** (0.016) (0.005)

Post-estimation Effect on black students: θ + σ [p-value] Panel B: Treatment effect by teacher race Performance pay program in school: θ

-0.015 0.182 Black teacher 0.034* (0.017) -0.032*** (0.011)

Performance pay program in school X black: σ

0.008 0.540 Not black teacher 0.021 (0.015) -0.053*** (0.011)

Post-estimation Effect on black students: θ + σ 0.001 -0.032*** [p-value] 0.920 0.002 Note: The effect of P4P on the black-white gap conditional on having a black teacher (-0.032) is significantly different than the effect on the gap conditional on having a non-black teacher (-0.053) with a p-value of 0.057. Each panel corresponds to one regression where the estimates in each column are from interactions with the treatment indicators. All regressions include controls from the main specification: teacher-by-school fixed effects, year-by-subject fixed effects, predicted EOC score, age, grade level, and gender. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

Table 7. Heterogeneity in treatment effect by size of bonus (1)

(2) (3) Student EOC test score (standard normalized) P4P X large maximum bonus (> $5,000) 0.054*** 0.052*** 0.055*** (0.014) (0.015) (0.017) P4P X small maximum bonus (<= $5,000) -0.019 -0.016 -0.044 (0.087) (0.049) (0.063) P4P X large max. bonus X black -0.058*** -0.060*** -0.050*** (0.010) (0.006) (0.006) P4P X small max. bonus X black -0.007 -0.029 -0.001 (0.023) (0.033) (0.042) School FEs X Tch X School FEs X X Year X Subject FEs X X X Additional controls X Observations 596,805 596,805 494,955 R-squared 0.645 0.512 0.506 Note: There are six possible “maximum bonus” amounts (depending on school district and year): Three are above $5,000; three are not. All regressions include controls for predicted EOC score, age, grade level, and gender. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

32

FIGURES Figure 1. Theory model 𝑓(𝑡) 𝑓" (𝑡" ) 𝑓" (𝑡"∗ )

𝑓" (𝑡̃" )

𝑓& (𝑡& )

𝑓& (𝑡&∗ )

𝑓& (𝑡̃& )

𝑡̃&

𝑡̃"

𝑡

𝑡"∗

𝑡&∗

Figure 2. Dynamics of treatment effects 3+ years pretreatment

1-2 years pretreatment

0-1 years posttreatment

2+ years posttreatment

0.15 0.1 0.05 0 -0.05 -0.1 -0.15

White

Black

BW Gap

The figure plots coefficients from a single regression, allowing for the possibility that students in “treated” schools are differentially impact 1-2 years prior to treatment, 0-1 years post-treatment, and 2+ years post-treatment. White bars indicate estimated effect on white students in the time period indicated, bars with diagonal black/white shading indicate the estimated differential effect on black students (or the effect on the black-white gap), and black bars indicate the effect on black students (the linear combination of the coefficients represented by white and shaded bars). Lines around coefficients indicate 95% confidence intervals.

33

Figure 3. Distribution of effects of randomly assigned placebo treatments on black-white achievement gap

As a histogram, we plot the distribution of 100 randomly assigned treatment effects on the black-white achievement gap (the “P4P X Black” coefficients). In each iteration of randomly assigning treatment effects, we randomly choose 34 schools to code as “treated” and randomly choose a start date for their treatment, after which their treatment indicator is set to 1. The actual estimated effect of treatment on the black-white gap is indicated by the black line (from Table 3). Our actual estimated effect is lower than 92% of estimated placebo effects.

34

APPENDIX I: Additional results Appendix Table 1. Unconditional and conditional black-white achievement gap in untreated samples (1)

(2)

(3) (4) (5) Student EOC test score (standard normalized)

(6)

(7)

Panel A: Never treated schools only Black -0.587*** -0.476*** -0.403*** -0.066*** -0.062*** -0.063*** -0.060*** (0.028) (0.018) (0.011) (0.005) (0.004) (0.005) (0.005) Panel B: All schools, before any treatment (before 2007) Black -0.622*** -0.491*** -0.385*** -0.052*** -0.046*** -0.052*** -0.047*** (0.035) (0.022) (0.009) (0.004) (0.004) (0.005) (0.005) Year X Subj FEs X X X X X X X School FEs X X X Tch X School FEs X X X Demos. & pred. ach. X X X X Add’l controls X X Regressions are at student-by-course level. “Demos. & pred. ach.” controls include: gender, grade level, age, and predicted EOC score. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

Appendix Table 2. Effect of individual performance pay programs on black-white achievement gap (alternative controls for lagged scores) (1) (2) (3) (4) Functional form of lagged score control: Linear Linear X black Nonparametric Nonparametric X black Panel A: Predicted score from 8th grade EOGs (original model) 0.033** 0.045*** 0.036** 0.040*** P4P in school: θ (0.014) (0.014) (0.015) (0.011) -0.045*** -0.059*** -0.048*** -0.052*** P4P in school X black: σ (0.012) (0.012) (0.012) (0.009) Panel B: Predicted score from 6th-8th grade EOGs 0.028** 0.039*** 0.030** 0.035*** P4P in school: θ (0.012) (0.011) (0.012) (0.011) -0.040*** -0.055*** -0.039*** -0.045*** P4P in school X black: σ (0.012) (0.012) (0.011) (0.009) Panel C: Directly control for math and reading scores from 8th grade EOGs 0.040*** 0.054*** 0.044*** 0.047*** P4P in school: θ (0.014) (0.013) (0.014) (0.011) -0.046*** -0.064*** -0.051*** -0.053*** P4P in school X black: σ (0.011) (0.012) (0.010) (0.009) Panel D: Directly control for math and reading scores from 6th-8th grade EOGs 0.029** 0.037*** 0.034*** 0.034*** P4P in school: θ (0.012) (0.012) (0.012) (0.012) -0.040*** -0.049*** -0.043*** -0.043*** P4P in school X black: σ (0.012) (0.012) (0.011) (0.012) Tch X School FEs X X X X Year X Subject FEs X X X X Additional controls X X X X Panels A to D use alternative controls for lagged scores, and Columns 1 to 4 use alternative functional forms of the specified control(s). The nonparametric functional form allows the specified control(s) to vary by decile (decile fixed effects). All regressions include controls for age, grade level, and gender. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

35

Appendix Table 3. Effect of individual performance pay programs on black-white achievement gap (with class-level fixed effects) (1) (2) Student EOC test score (standard normalized) P4P in school X black: σ -0.040*** -0.030*** (0.008) (0.010) Class-level FEs X X Year X Subject FEs X X Additional controls X Observations 596,805 494,955 R-squared 0.737 0.733 Note: With class-level fixed effects, we cannot identify the main effect of treatment. All regressions include controls for predicted EOC score, age, grade level, and gender, as well as school-specific linear time trends. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

Appendix Table 4. Effect of individual performance pay programs on black-white achievement gap (with school-specific linear trends) (1)

P4P in school: θ P4P in school X black: σ

(2) (3) (4) (5) Student EOC test score (standard normalized) 0.013 0.058 0.056 -0.004 0.043* (0.052) (0.055) (0.058) (0.020) (0.022) -0.055*** -0.046*** -0.058*** (0.009) (0.013) (0.010)

Post-estimation Effect on black students: θ + σ 0.003 0.010 [p-value] 0.955 0.834 School FEs X X X Tch. X School FEs X Year X Subject FEs X X X X Additional controls X Observations 596,805 596,805 494,955 596,805 R-squared 0.651 0.651 0.641 0.515 All regressions include controls for predicted EOC score, age, grade level, and gender, as well “Additional controls” include controls for census block poverty rates and parent education, sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

-0.015 0.366

(6) 0.019 (0.025) -0.046*** (0.012) -0.027 0.124

X X

X X X 596,805 494,955 0.515 0.509 as school-specific linear time trends. which are not available for the full

Appendix Table 5. Treatment balance tests (1) Student race (share white)

(2) Teacher race (share white)

(3) Poverty rate

(4) Parent education: <=HS

(5) Parent education: < 4-yr college

(6) Parent education: >= 4-yr college

(7) Share free or reduced price lunch P4P in school -0.003 -0.003 0.004 0.019 -0.002 -0.017 0.014 (0.013) (0.029) (0.004) (0.033) (0.017) (0.018) (0.022) Year FEs X X X X X X X School FEs X X X X X X X Observations 3,551 3,551 3,551 3,551 3,551 3,495 3,306 R-squared 0.954 0.595 0.897 0.734 0.602 0.763 0.763 Regressions are at school level and weighted by school size. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

36

Appendix Table 6. Effect of P4P on black-white gap, dropping one district at a time (1)

(2) (3) Student EOC test score (standard normalized) Dropping Cumberland LEA Dropping Guilford LEA Dropping Charlotte-Meck. LEA Performance pay program in school: θ 0.041*** 0.027 0.045*** (0.011) (0.017) (0.012) Performance pay program in school X black: σ -0.056*** -0.034 -0.064*** (0.012) (0.022) (0.005) Year X Subject FEs X X X Tch X School FEs X X X Observations 544,005 519,443 518,330 R-squared 0.512 0.512 0.514 All regressions include controls for predicted EOC score, age, grade level, and gender. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

Appendix Table 7. Effect of individual performance pay programs on black-white achievement gap (restricted to post-NCLB years) (1) P4P in school: θ P4P in school X black: σ

0.009 (0.027)

(2) (3) (4) (5) Student EOC test score (standard normalized) 0.045 0.034 -0.001 0.034*** (0.030) (0.036) (0.010) (0.012) -0.045*** -0.039*** -0.044*** (0.011) (0.014) (0.010)

(6) 0.023* (0.014) -0.034*** (0.012)

Post-estimation 0.000547 -0.00463 -0.00979 -0.0107 Effect on black students: θ + σ [p-value] 0.983 0.880 0.333 0.310 School FEs X X X Tch X School FEs X X X Year X Subject FEs X X X X X X Additional controls X X Observations 479,443 479,443 417,121 479,443 479,443 417,121 R-squared 0.638 0.638 0.628 0.510 0.510 0.504 All regressions include controls for predicted EOC score, age, grade level, and gender. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

Appendix Table 8. Effect of individual performance pay programs on black-white achievement gap (race fully interacted with all controls) (1)

(2) (3) (4) Student EOC test score (standard normalized) 0.029 0.039 0.015 0.021 P4P in school: θ (0.023) (0.035) (0.013) (0.016) -0.030** -0.037** -0.022** -0.022** P4P in school X black: σ (0.014) (0.016) (0.009) (0.011) School FEs X X Tch X School FEs X X Year X Subject FEs X X X X Additional controls X X Observations 596,805 494,955 596,805 494,955 R-squared 0.648 0.639 0.517 0.510 All regressions include controls for predicted EOC score, age, grade level, and gender. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

37

Appendix Table 9. Effect of individual performance pay programs on black-white achievement gap (controlling for potential treatment effect along other dimensions) (1) (2) (3) Student EOC test score (standard normalized) -0.011 0.010 -0.008 P4P in school: θ (0.014) (0.012) (0.014) -0.024*** -0.037*** -0.025*** P4P in school X black: σ (0.005) (0.007) (0.005) P4P X Parent ed: >HS, <4-yr coll. 0.013 0.017 0.014 (0.010) (0.012) (0.011) P4P X Parent ed: >=4-yr coll. 0.035** 0.047** 0.035** (0.017) (0.021) (0.017) P4P X Above Med. Pred. Score 0.062*** 0.069*** (0.022) (0.022) P4P X High Poverty N’hood -0.004 -0.006 -0.004 (0.006) (0.007) (0.006) P4P X Top Half of Class 0.009 -0.013 (0.011) (0.008) Tch X School FEs X X X Year X Subject FEs X X X Additional controls X X X Observations 494,955 494,955 494,955 R-squared 0.507 0.506 0.507 All regressions include controls for predicted EOC score, age, grade level, and gender. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

Appendix Table 10. Effect of P4P on black-white gap conditional on class composition (alternative cut offs) (1) (2) Student EOC test score (standard normalized) Panel A: Treatment effect by class comp. Class black share > 0.2 Class black share ≤ 0.2 0.032** 0.068*** P4P in school: θ (0.014) (0.017) -0.043*** -0.075*** P4P in school X black: σ (0.012) (0.013) Panel B: Treatment effect by class comp. Class black share > 0.4 Class black share ≤ 0.4 0.017 0.075*** P4P in school: θ (0.015) (0.013) -0.029** -0.052*** P4P in school X black: σ (0.013) (0.009) Panel C: Treatment effect by class comp. Class black share > 0.6 Class black share ≤ 0.6 0.002 0.049*** P4P in school: θ (0.019) (0.012) -0.017 -0.053*** P4P in school X black: σ (0.018) (0.008) Panel D: Treatment effect by class comp. Class black share > 0.8 Class black share ≤ 0.8 -0.035 0.038*** P4P in school: θ (0.026) (0.013) 0.017 -0.047*** P4P in school X black: σ (0.034) (0.011) Each panel corresponds to one regression where the estimates in each column are from interactions with the treatment indicators. All regressions include controls from the main specification: teacher-by-school fixed effects, year-by-subject fixed effects, predicted EOC score, age, grade level, and gender. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

38

Appendix Table 11. Effects of P4P on black-white gap conditional on teacher and class composition, allowing for treatment effects stemming from rich teacher and class characteristics (1)

(2) Student EOC test score (standard normalized)

P4P in school P4P X black P4P X black X black teacher P4P X black X maj. black class Year X Subj FEs Tch X School FEs Class-level FEs Additional controls (all fully interacted with treatment indicator)

0.026 (0.028) -0.051*** (0.004) -0.009 (0.013) 0.041*** (0.016) X X

-0.052*** (0.004) -0.002 (0.008) 0.041*** (0.013) X

X Additional teacher characteristics (years of experience, indicator for holding a graduate degree, score on teaching licensing exams), Indicator for Honors course

Observations 482,068 482,068 R-squared 0.523 0.738 All regressions include controls for predicted EOC score, age, grade level, and gender. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

Appendix Table 12. Heterogeneity in treatment effect by size of bonus (with continuous measure of size of bonus) (1) (2) (3) Student EOC test score (standard normalized) P4P maximum bonus (in thousands, no P4P=$0) 0.005** 0.001 0.001 (0.002) (0.001) (0.001) P4P maximum bonus x black -0.003*** -0.003*** -0.002** (0.001) (0.001) (0.001) School FEs X Tch X School FEs X X Year X Subject FEs X X X Additional controls X Observations 596,805 596,805 494,955 R-squared 0.645 0.512 0.506 All regressions include controls for predicted EOC score, age, grade level, and gender. “Additional controls” include controls for census block poverty rates and parent education, which are not available for the full sample. Robust standard errors clustered at the school district level in parentheses. *** p<0.01, ** p<0.05, * p<0.1

39

APPENDIX II: Extending the simple theoretical model The extra effort allocated to high ability types is driven by the desire to bring up the average level of achievement in the least costly way. This should only be possible if there are a sufficient number of high ability students in the class to make this feasible. We therefore now turn to evaluating how optimal teacher effort is affected by class composition. Again assuming that the performance pay parameters incentivize teachers to exert more effort and obtain the bonus, we !

know that teachers exert exactly the amount of effort required to obtain the bonus !

∗ ! ! 𝑓! (𝑡! )

= 𝑋.

This becomes 𝑝! 𝑛𝑓! 𝑡!∗ + 1 − 𝑝! 𝑛𝑓! (𝑡!∗ ) = 𝑋 when simplified to two types of students. Recognizing that 𝑡!∗ and 𝑡!∗ are functions of the share of low type students 𝑝! , we can totally differentiate this constraint with respect to 𝑝! to obtain the following equation: [𝑝! 𝑡!∗ ! (𝑝! ) + 1 − 𝑝! 𝑡!∗ ! (𝑝! )] 𝑓!′ (𝑡!∗ ) = 𝑓! (𝑡!∗ ) − 𝑓! (𝑡!∗ ) Given the positive achievement gap 𝑓! (𝑡!∗ ) − 𝑓! 𝑡!∗ > 0 and that 𝑓!′ (𝑡!∗ ) > 0, it must be the case that 𝑝! 𝑡!∗ ! 𝑝! + 1 − 𝑝! 𝑡!∗ ! 𝑝!

>0 for all 𝑝! , so 𝑡!∗ ! 𝑝! > 0 and 𝑡!∗ ! 𝑝! > 0. As

the share of low type students in the class increases, the teacher increases her effort to both types of students. However, changing the share of low type students in the class also affects the decision of the teacher to pursue the bonus in the first place. In order for the teacher to be pursuing the bonus, it must be the case that the net bonus exceeds the additional effort required to obtain the bonus: that is, 𝐵 − 𝑐 ≥ 𝑐 𝑝! 𝑛𝑡!∗ + 1 − 𝑝! 𝑛𝑡!∗ = 𝑐([𝑝! 𝑡!∗ + 1 − 𝑝! 𝑡!∗ ]𝑛). Denoting the average effort per student 𝑒 = [𝑝! 𝑡!∗ + 1 − 𝑝! 𝑡!∗ ] and given 𝑛 > 0, costs will increase with the share of low type of students if the average effort per student increases with the share of low type students. It can be shown that

!! !!!

>0↔

∗ !! (!! )!!! !!∗ ∗ !! ∗ !! !

> 𝑓!′ (𝑡!∗ ), the right hand side of which is necessarily

true given the assumptions on the shapes of the student improvement functions and the assumption that 𝑡! ≥ 𝑡! . Teacher effort costs therefore increase with the share of low type students in the class. This means that if the share of low type students in the class increases, the teacher may reach the point where it becomes optimal to return to effort 𝑡 and no longer obtain the bonus.

40

APPENDIX III: Student-teacher matching algorithm We use counts of students in a set of grade and gender-by-race bins to measure classroom demographic composition. Grade is 9th, 10th, 11th or 12th, gender is male or female, and race is white, black or Hispanic, resulting in four grade bins and six gender-by-race bins. With the addition of total student count (reported separately to the other characteristics in the classroom-level file), there are a total of 11 dimensions used for describing classroom composition. Matches on classroom demographic composition from the two sources are not expected to be perfect for a variety of reasons. First, the reported and constructed measures of classroom demographic composition are from different points in time; the classroom-level information is from the beginning of a course, while the student-level EOC file reflects composition at the end of a course (when students take the EOC test). Students may change classrooms or schools during this period. Second, some students may take the course but not write the EOC test (if they are absent on test day, for example). And, third, it is possible that classroom demographics from both sources are simply measured with error. As a result, we use the following algorithm to obtain course-specific student-teacher matches: (Matched classes are set aside after each step.) 1.

In schools with only one teacher of the relevant course (in a given year), students and the teacher are matched. For example, in a school with only one teacher of Algebra I, all students writing the Algebra I EOC test are matched to this teacher. (13 percent of matched students are linked to their EOC teachers in this step.)

2.

When reported classroom demographic composition (from the classroom-level file) perfectly matches constructed composition (from the student-level file) in all 11 categories, students from the student-level file are matched to teachers from the classroom-level file. (31 percent)

3.

“Total student count” is excluded from the measure of classroom demographic composition (for this step and future steps). This is because it is the sum of students in either the grade or race-gender bins, so would exaggerate errors if the counts in any of these bins are incorrect. When reported composition perfectly matches constructed composition in the remaining 10 categories, students and teachers are matched. (<1 percent)

4.

When reported composition perfectly matches constructed composition in 9 out of the 10 categories, students and teachers are matched if this match is unique and the deviation in the unmatched category is less than 2. In other words, students and teachers are matched if there 41

is only a small mismatch on only one dimension of classroom demographic composition. If one classroom in the student-level file matches with multiple classrooms in the classroomlevel file, students and teachers are matched to the classroom for which the deviation in the unmatched category is smallest (provided it is less than 2). If there are multiple matches with the same smallest deviation in the unmatched category, the classroom of students from the student-level file is dropped. (5 percent) 5.

Repeat the above step, but link students and teachers with perfect matches on 8 out of 10 categories, and keep matches if the sum of deviations in the unmatched categories is less than 4. (26 percent)

6.

The final steps in the algorithm use a fuzzy algorithm based on an overall distance measure: the sum of the absolute value of deviations in the 10 categories. Beginning with the constructed composition from the student-level file, find the best match in the classroomlevel file, dropping classrooms from the student-level file with multiple best matches. Given that a classroom from the classroom-level file may be matched to multiple classrooms in the student-level file, for every classroom in the classroom-level file, only keep the match with the smallest distance measure to ensure mutual best matching. Repeat the above step after setting aside the matches from the first iteration of the fuzzy algorithm. (25 percent)

42

Teacher incentive pay and the black-white achievement gap

Email: [email protected]. ... Email: [email protected]. ... students' scores (depending on the grade level and subject being tested).3 ...... Repeat the above step, but link students and teachers with perfect matches on 8 out of ...

5MB Sizes 0 Downloads 302 Views

Recommend Documents

Teacher incentive pay and the black-white achievement gap
b Jones: University of South Carolina, Darla Moore School of Business, .... (2016) take advantage of a survey in the US which asks two teachers to ...... trend in student EOC test scores in the treated schools would bias the estimated effect ...

Paying for whose performance? Teacher incentive pay and the black ...
Performance pay for school teachers is becoming increasingly common ..... white students are in the top half of the class, while the same is true for only 41.7 .... be different for black students due to differential degrees of “summer slide”.

School Quality and the Black-White Achievement Gap
A two stage adaptive testing procedure was used to measure achievement. ... this calculation produces erroneous measures of the within- and between- school ..... are included indicates that this is highly unlikely to exert a meaningful impact.

Unfinished-Business-Closing-The-Racial-Achievement-Gap-In-Our ...
Read On the web and Download Ebook Closing The Achievement Gap In Americas Schools. Download United States. Congress. House. Committee on ...

International Differences in the Family Gap in Pay
of Labor. March 2007 ..... where i indexes individuals; lnEARN is the natural log of earnings; PART is a dummy variable for part-time .... We then turn to the reduced form specifications that include the various labor market institutions we are. 11 .

probation length and teacher salaries: does waiting pay ...
for experienced teachers and in districts that engage in collective bargaining. 1 enure for elementary and ... California's. Proposition 74 .... college with a teaching degree. Unlike teacher testing and other certification requirements, .... almost

The impacts of performance pay on teacher ...
Relative to these papers, our paper aims to contribute by providing a direct ..... does not rely on within-teacher variation in treatment and makes no attempt to ..... significantly less likely to assign “Worksheets” and are more likely to assign

Does Sorting Play Any Role in the Incentive Pay-Output ...
Jul 26, 2010 - incentives by paying for performance, firms may choose to maintain a ...... (SRC) or the Survey of Economic Opportunity (SEO) conducted by the ...

The Academic Achievement Gap in Grades 3 to 8
May 1, 2006 - David Aman for useful information on test administration in North Carolina, Frank Levy .... systems. Using data from Pasadena, Bali and Alvarez (2004) find ...... It shows that school careers with normal grade progression were ...

International Differences in the Family Gap in Pay: The Role of ... - IZA
These studies find that the family gap varies considerable across countries, but none has ... 1980s, with respect to their provision of policies that support mothers' ...

International Differences in the Family Gap in Pay: The Role of ... - IZA
unemployment benefits and trade union coverage, are instead associated to a larger ... Table 1 shows the list of countries, the years each country appears in the data and ..... right after your child was born and during pre-school period?

Teacher Pay Scale for 2016-2017.pdf
Teacher Pay Scale for 2016-2017.pdf. Teacher Pay Scale for 2016-2017.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Teacher Pay Scale for ...

Islam and Science: The data gap
Nov 2, 2006 - The official statistics database of the. Organization of the Islamic Conference. (OIC) reveals information on each of the 57. OIC nations on everything from arable land per tractor to Internet users, but you won't find any data on resea

The Support Gap
program services such as ESL, Special Education (SPED), and Gifted and Talented ... provide a basis for determining academic, instructional, and technology.

The Support Gap
and career ready? ... job openings created by 2018 will require workers with at least some college education ... Information, Media, and Technological Skills .... To what extent is technology being used by ESL students versus non-ESL students ...