Journal of Urban Economics 100 (2017) 54–64

Contents lists available at ScienceDirect

Journal of Urban Economics journal homepage: www.elsevier.com/locate/jue

Can testing improve student learning? An evaluation of the mathematics diagnostic testing project Julian R. Betts a, Youjin Hahn b,∗, Andrew C. Zau a a b

Department of Economics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA School of Economics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 120-749, South Korea

a r t i c l e

i n f o

Article history: Received 11 March 2015 Revised 17 April 2017 Available online 25 April 2017 JEL classification: I21 Keywords: Diagnostic testing Mathematics Education Feedback Accountability

a b s t r a c t Unlike state-mandated achievement tests, tests from the Mathematics Diagnostic Testing Project (MDTP) offer teachers timely and detailed feedback on their students’ achievement. We identify the effects of providing feedback on student outcomes by using data from the San Diego Unified School District, a large urban school district where mandatory diagnostic tests in mathematics were implemented to some grades between 6 and 9 during 1999 and 2006. These diagnostic tests offer teachers timely and detailed feedback on their students’ achievement. We find that providing diagnostic feedback improves math test scores by roughly 0.1 standard deviations. We do not find significant differences in the effect based on students’ prior math achievement. © 2017 Elsevier Inc. All rights reserved.

1. Introduction Partly in response to the federal No Child Left Behind law of 2001, U.S. states have adopted statewide mandatory testing of public school students meant to measure academic achievement and progress towards universal student proficiency. These tests, often referred to as summative tests, are meant to measure overall student achievement, rather than diagnosing specific areas in which students need to improve. The existing state tests lack this level of detail, and often they take so long to grade that the school year has ended before schools receive results. Frustrated in part with the limited ability of the existing state tests to diagnose specific learning difficulties in a timely fashion, many states have recently signed on to the “Common Core” – a new curriculum that aims to impart higher order skills to students. Accompanying the new standards are two multi-state consortia that unveiled new tests in 2015 that aim to provide schools with much more timely and detailed reports on student progress. Both the Smarter Balanced Assessment Consortium (SBAC) and the Partnership for Assessment of Readiness for College and Careers (PARCC) are designing computer-adaptive tests for use in multiple states, which will provide speedy feedback on student perfor-



Corresponding author. E-mail addresses: [email protected] (J.R. Betts), [email protected] (Y. Hahn), [email protected] (A.C. Zau). http://dx.doi.org/10.1016/j.jue.2017.04.003 0094-1190/© 2017 Elsevier Inc. All rights reserved.

mance. In addition to providing diagnostic information on what materials students have and have not mastered the two consortia are developing optional interim assessments that districts can use to gauge student progress during the year.1 These reforms reflect potentially monumental changes to the way the nation tests, with a total of 11 states administering PARCC and 17 states administering the SBAC tests in 2015–16. A major question underlying these new developments is whether diagnostic testing that provides detailed diagnosis of students’ strengths and weaknesses, with rapid turnaround time, can make a tangible contribution to student learning. Very little is known about the impact of such tests.2 We therefore study a form of diagnostic testing already widely used in California, and test whether its adoption at the district level leads to better academic achievement. California, like other states, has implemented an educational accountability system designed to measure the performance of students in public schools, and to provide a series of interventions for schools that fail to improve sufficiently. During the period we

1 Information on the two consortia can be found at www.parconline.org and www.smarterbalanced.org. 2 A literature on formative assessment, defined as classroom assessments conducted by teachers on materials recently taught, suggests that such assessments can boost student performance. See Black and Wiliam (1998) and Collins et al. (2011) for a discussion of formative assessment. Note however, that in the present study we focus on more formal tests that originate outside the teacher’s own classroom.

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64

study, several mandated tests fed into the calculation of the summary Academic Performance Index (API), the single number the state published annually as a proxy of each school’s academic performance. During the period we study these tests include the California Standards Test (CST), offered in a variety of subjects between grades 2 and 11, and the California High School Exit Exam, which high school students had to pass in order to receive a diploma. This accountability program has informed the public about average achievement levels in California and the large variations in achievement across schools and across demographic groups. However, the testing system does not provide timely feedback to teachers about their students’ performance, or about the specific subareas of knowledge within a subject in which individual students need to improve. Notably, results from CST tests given in March did not become available until late summer. In its first two years of SBAC testing in 2014–15 and 2015–16, California has delivered test results sooner to schools, but as of the 2015–16 administration the tests offered little in the way of diagnosis of specific areas of student strength and weakness. Greater detail is planned in the future, but students and teachers receive only an overall score and performance information on four claims, which are broad areas such as “concepts and procedures” and “communicating reasoning” . This paper examines the effectiveness of providing diagnostic feedback to teachers and students, which is much more detailed than the state test results described above. The aforementioned state tests are designed mainly for purposes of external accountability rather than to improve learning. There is a unique diagnostic test, with a quite different purpose, that has been freely offered to math teachers and their students throughout California since 1982. The Math Diagnostic Testing Project (MDTP) is a joint program of the California State University and the University of California, and it offers free diagnostic testing to math teachers throughout California. The MDTP is designed to allow teachers to obtain diagnostic testing of their students, with speedy and detailed feedback. Teachers typically receive printed results within days of test administration, along with overall and student-bystudent information on student performance on individual topics within the subject of the test. For example, the MDTP’s Algebra Readiness test, designed to be given to students before their first algebra course, gives information to teachers and their students on students’ understanding of integers, fractions, decimals, percentages and six other clearly defined topics. We identify the effects of providing feedback on student outcomes by using data from the San Diego Unified School District (SDUSD), where mandatory diagnostic tests in mathematics were implemented to some grades between 6 and 9 during 1999 and 2006. Using variation arising from different timing and targeted grades of MDTP, we find that students who in the previous year took an MDTP test under the mandatory district-wide program gained 0.09 standard deviations in math performance compared to those who did not take the MDTP test. The effect is slightly smaller than the finding by Dee and Jacob (2011), who estimate an effect size of 0.14 in grade 8 math for the implementation of NCLB including testing plus accountability provisions. In a famous experiment, Finn and Achilles (1990) estimate that reducing class size by roughly one third in elementary schools in Tennessee improved math and reading achievement by 0.27 and 0.23 of a standard deviation. Note however the significant costs of reducing class size compared to the much smaller costs of a full-blown accountability system, and the yet less expensive cost of the MDTP’s 45 min test, offered free throughout California and online outside California for a nominal fee. Quite a few studies have examined broader accountability systems such as No Child Left Behind (see Dee and Jacob, 2011 for an evaluation of its impact and a helpful review of earlier work on

55

accountability systems, and Hamilton et al., 2007 and Craig et al., 2013 for an analysis of how teachers and administrators have responded). However, there is sparse rigorous evidence on the effectiveness of low-stakes diagnostic testing on educational outcomes. One notable exception in the recent literature is Muralidharan and Sundararaman (2010), which evaluates the effect of providing diagnostic feedback to teachers on student learning outcomes in India. Their paper finds that students in feedback schools where teachers received detailed test-score results of students do not perform differently from the students in control schools, while the teachers in feedback schools exerted more effort when observed in the classroom. Unlike the experimental setting in India, the mandatory MDTP test used in our study was implemented in a time and place where schools were evaluated for accountability systems, which would make teachers have higher incentive to use the diagnostic feedback to improve student learning. Our findings suggest that providing diagnostic feedback could be effective in a situation where schools and teachers are accountable for student outcomes, whereas it may not be as effective in encouraging teachers’ intrinsic motivation which eventually leads to student learning under a weak accountability system, as in Muralidharan and Sundararaman (2010). Our paper also relates to some degree with the large literature on peer effects, as one of the mechanisms through which diagnostic testing could influence achievement is through ability grouping or, in higher grades, tracking into different tiers of coursework in the same subject. Studies of peer effects, such as Hanushek et al. (2003), Duflo et al. (2011) and Lefgren (2004), typically find positive peer effects on student achievement. De Fraja and MartinezMora (2014) provides a theoretical analysis of the tradeoff between housing segregation and tracking within schools. For recent reviews of the literatures on peer effects and tracking, see Sacerdote (2011) and Betts (2011) respectively.

2. Background The MDTP consists of a set of “readiness” tests designed to give students and teachers detailed feedback on the student’s readiness to move on to a given course in the next academic year. During the time period of this study, the most commonly used MDTP tests in the SDUSD were the Pre-algebra Readiness Test, the Algebra Readiness Test and the Geometry Readiness Test. For example, the Algebra Readiness test assesses a student’s preparation in topics that are required knowledge for a student to fare well in a subsequent algebra course, while the Geometry Readiness Test evaluates a student’s understanding of first-year algebra topics that students will need to have mastered in order to do well in a subsequent geometry course. Each test lasts 45 min and contains, depending on the subject matter, 40–50 multiple-choice questions. The teacher will receive for each student, as well as for his or her class, indicators of the percentage of questions answered correctly in specific areas.3 Areas in which students answer correctly fewer than 70% of questions will be flagged for teacher follow-up. The teacher receives this information individually for each student, but also receives statistics on the percentage of students scoring 70% or higher on each math topic, which might help him or her focus on areas for greater emphasis and clarity in the classroom in general.

3 For example, in the Algebra Readiness test, the areas include: Data Analysis, Probability and Statistics; Decimals, their Operations and Applications; Percent; Simple Equations and Operations with Literal Symbols; Exponents and Square Roots; Scientific Notation; Fractions and Their Applications; Measurement of Geometric Objects; Graphical Representation; Integers, Their Operations and Applications.

56

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64

Table 1 SDUSD mandated use of MDTP tests by school year, test and grade levels. Year

MDTP readiness test

Grade levels

1999 20 0 0

Geometry Geometry

2001 2002 2003 2004 2004 2005 2005 2006 2006

Geometry Geometry Algebra Pre-Algebra Algebra Pre-Algebra Algebra Pre-Algebra Algebra

8 and 9 8 and 9 in the Spring, and Grade 8 students enrolled in Algebra during summer school 8 and 9 8 and 9 7 6 at selected schools 7 6 7 6 7

Note: Year refers to the start of the school year e.g. 2006 refers to the 20 06–20 07 school year. This information was kindly provided by Bruce Arnold, the Coordinator of MDTP for the San Diego area.

The intent of these tests is thus quite different from statewide testing systems as mandated under the federal No Child Left Behind law. The latter are designed to measure overall school progress and results can trigger various interventions. The MDTP tests are designed to help students and teachers to work together on specific math topics that the student has yet to master. Nonetheless, the mandated MDTP test was not entirely a lowstakes diagnostic test. The test was given in May and June, and was one of many factors that teachers took into account when placing students into math classes for the subsequent year. One key distinguishing feature of the MDTP is that unlike the state tests that comprise California’s official accountability system, the MDTP tests are graded locally in one of ten regional site offices. The proximity of the grading centers confers two advantages, by speeding up turnaround time dramatically, and by enabling regional coordinators to work with local schools to interpret and use test results. The accessibility of the local MDTP offices is largely responsible for the ability of the MDTP to provide timely and detailed feedback to teachers, which they can immediately put into use by tailoring their instruction to the specific needs of individual students.4 But perhaps the most important distinction from the state test in use at the time of our study is that the MDTP gives teachers and students very specific feedback on areas within a given curriculum where students need to improve. In 20 07–20 08, MDTP scored approximately 40,0 0 0 tests for 95 SDUSD schools (including private schools in the SDUSD region). In general, most of the testing occurs in middle schools, but a significant number of high school students are also tested. For instance, of the 40,0 0 0 tests done in 20 07–20 08, 10,50 0 tests were scored at 22 schools for which the highest grade was grade 12.5 Beginning in 1999–20 0 0 the district as a whole mandated the use of at least one MDTP test at the end of the school year. These mandates were phased out during our sample period in higher grades and implemented for the first time about half way through our sample period in lower grades. In addition, as students move from grade to grade, they will be subject to testing in only a few grades. Thus, the effects of MDTP are identified by comparing individual student trends in achievement during years and grades with and without testing. Table 1 shows the way in which the SDUSD mandated testing. Table 2 shows the actual proportions of students in each grade and school year who took an MDTP diagnostic test. It suggests

4 More details on the MDTP tests and the history of the program are available at http://mdtp.ucsd.edu. 5 This information was provided by Alfred Manaster, who is the emeritus State Director of the MDTP.

Table 2 The proportion of students who took MDTP for a given grade and year. Year

Grade 6

7

8

9

10

20 01–20 02 20 02–20 03 20 03–20 04 20 04–20 05 20 05–20 06 20 06–20 07

0 0 0 0.23∗ 0.95 0.63

0.03 0.03 0.88 0.91 0.88 0.63

0.81 0.6 0.02 0 0 0.01

0.51 0.43 0 0 0 0

0.04 0.03 0 0 0 0

Note: Calculated using the San Diego Unified School District data. Numbers in bold and italics indicate the mandated years. ∗ In school year 20 04–20 05, MDTP was mandated for 6th year students in selected schools.

fairly high, but not complete, compliance with the district mandate. The estimated effect of MDTP could be biased if there is nonrandom compliance. As discussed in Section 5.2, we find the effects of actual MDTP taking and “intent-to-take” are similar, suggesting that non-random compliance is unlikely to bias our estimates. In addition to the mandated district-wide MDTP testing, individual teachers in the district could voluntarily opt to use the MDTP tests for any of their classrooms. However, due to the low take-up rates of voluntary testing and potentially endogenous choice made by teachers who voluntarily adopt MDTP, we focus on the natural experiment in which the district phased in mandated MDTP testing. The results in the paper on the relationship between MDTP testing and subsequent performance are not sensitive to whether we add a control for voluntary testing. 3. Data The SDUSD consists of approximately 135,0 0 0 students in preschool through grade 12. It is the second largest district in California and as of 2015 was the 20th largest urban district in the United States. Our longitudinal data consists of complete student academic records, including test scores, academic grades, courses taken and absences, from fall 2001 through spring 2007. The data include indicators of MDTP tests taken in a given year, and a rich set of covariates related to the student’s class size and teacher qualifications. Teacher qualifications at the SDUSD fall under three main areas: education, credentialing, and years of experience. For most teachers, our data include information on degrees attained (bachelor’s, master’s, or doctorate), and subject major. We control for the level of overall credential and the math subject authorization of the student’s homeroom teacher (in elementary grades) and the student’s math teacher in middle and high school.6 Appendix Table A1 shows summary statistics. California has administered various standardized tests at different times. It mandated the Stanford 9 test in spring 1998 through spring 2002, and the California Standards Test (CST) in spring 2002 and later years. Our outcome of interest is a standardized math CST score. We use CST data from spring 2002 through 2007. We convert the CST score into Z-scores. Up until grade 7, one type of CST is offered in each grade. For those students in grade 7 and

6 Credentials can be divided into two types: overall credential, and subject authorization. The credential simply indicates that the teacher has completed requirements thought to help a person operate a classroom. A math subject authorization, in contrast, indicates that a teacher has taken a requisite set of college math courses. For elementary teachers, subject authorization is not necessary as a multiple subject credential is sufficient to teach within a self-contained classroom. Secondary school teachers usually obtain an authorization to teach specific subjects, which requires that teachers complete prescribed college courses in the given subject. Authorizations fall under four levels. In declining order, these are full, supplementary, board resolution, and limited assignment emergency. Further details of the credentialing system may be found in Chapter 3 of Betts et al. (2003).

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64 Table 3 Seven popular test-type sequences. Grade 8

9

10

11

Freq.

Percentage

Cum.

AL1 AL1 AL1 AL1 AL1 Gmtry AL1

Gmtry AL1 AL1 AL1 Gmtry AL2 AL1

AL2 Gmtry Gmtry Gmtry AL2 HsMath AL1

HsMath AL2 I. Math Gmtry AL2 HsMath Gmtry

1269 956 397 250 177 153 110

30.71 23.14 9.61 6.05 4.28 3.7 2.66

30.71 53.85 63.46 69.51 73.79 77.49 80.15

Note: We followed the cohort who was 8th grade in 20 01–20 02 academic year up to when they became 11th grade 20 04–20 05. Subject names are: AL1: Algebra 1; AL2: Algebra 2; Gmtry: Geometry; HsMath: High school math; I. Math: Integrated Math. The test for high school math incorporates questions taken from the Geometry, Algebra 1 and Algebra 2 tests. Integrated math tests incorporate questions taken from other end-of-course mathematics tests, such as Geometry, Algebra 1, Algebra 2, and Probability and Statistics (source: http://www.cde.ca.gov/ta/tg/sr/ blueprints.asp).

below, we standardize the CST score by subtracting the districtwide mean and dividing by the district-wide standard deviation, for a given grade. Starting in grade 8, students take different versions of the math CST depending on the subject matter they study. Thus, for the students in grade 8 and above, we standardize the CST score by grade and type of test. We present in Table 3 the seven most common sequences of test-taking for cohorts who are 8th grade in 20 01–20 02. Up until 8th grade, students take similar courses. However, students start diverging in terms of their courses-taking behavior from 9th grade. For instance, some students in 9th grade repeat Algebra 1, while others take Geometry or Algebra 2. This diversity in course-taking behavior implies that standardizing the test score by type of test and grade is required for upper grade students. Given that in grade 9 and later students diverge a lot in the type of math course taken, and therefore in the type of math CST test taken, in the main text we focus on the introduction of mandated testing in grades 6 and 7, while modeling achievement growth in grades 3 through 8. When we extend the analysis through grade 11, which allows us to study the impact of mandated testing in grades 8 and 9 as well, results are similar, with slightly larger effects. The Appendix shows some of these latter results. 4. Empirical strategy We estimate two different models, and choose the first model below based on specification tests that we describe later. In our preferred specification, we model the current CST score as a function of the student’s prior year CST score, student and family characteristics, teacher qualifications, fixed effects for school, year and type of test and an error term. The district-mandated MDTP testing took place in May or June of the academic year, while the CST is offered in March or April. The basic model is the following.

Yicgst =

σs + αYicgs,t−1 + β MDT Pi,t−1 + γ Studentit + δ ClsSizeicst + ρ Schoolist + μ T eachericst + π Gradeit + τ Yeart + η T estit + εicgst

(1)

We refer to this as the Lagged Dependent Variable (LDV) model. Here, Yicgst is a standardized CST score for student i in (math) classroom c, grade g, school s, in year t. σ s is a set of school fixed effects and Yicgst ,t −1 is the lagged math test score. MDT Pi,t−1 is a binary variable indicating whether the individual student took the test as part of the district mandate, rather than whether the student was in a mandated cohort. For a robustness check, we later report the intent-to-treat effect as well, replacing the MDTP dummy variable with an indicator for whether students in the given student’s grade

57

and year were mandated to take the MDTP test. The coefficient of lagged MDTP, β , is of primary interest in this study. Because test scores are available for grades 2–11, and we are modeling achievement through grade 8 for aforementioned reasons, our main models cover grades 3–8.7 The other control variables are as follows. A vector Student contains dummies for each student’s gender, ethnicity, parental education and whether the individual student was an English learner. ClsSize is an annual average math class size for each student. School is a vector of time-varying school characteristics, consisting of the percentage of students on free lunch; percentage who were Asian, White, Hispanic, and African American; and percentage who were English Learner students. These are constructed by averaging variables for each school in a given year across semesters. We also include various teacher characteristics. The Teacher vector includes the proportion of teachers for each student in a given year across semesters who have credentials (intern, emergency credential, full credential); math subject authorization (full, supplemental, board resolution, limited assignment emergency); years of teaching experience (overall, in district, in math and in district in math); Master’s degree; Bachelor’s in math; the Cross-cultural Language and Academic Development credential (CLAD); and indicators for teachers who are black, Asian, Hispanic, other nonwhites and female. Grade consists of dummy variables for each grade level. Year dummies are included to account for any changes in the educational system by which all students in the district are affected. Lastly, we control for the type of tests taken by adding Test dummies for the version of the math CST test that the student took in the given year; Algebra1, Algebra 2, Math 8/9, Geometry, High School Math, and Integrated math.8 As a second model, we estimate a “gain score” version of Eq. (1) that models changes in test scores. Therefore, we relate the outcome variable, the change in standardized CST score between year t − 1 and t, to the indicator of whether a student took MDTP in year t − 1, as expressed in Eq. (2).

Yicgst − Yicgs,t−1 =

σs + β MDT Pi,t−1 + γ Studentit + δ ClsSizeicst + ρ Schoolist + μ T eachericst + π Gradeit + τ Yeart + η T estit + εicgst

(2)

This model identifies the effect of MDTP by changes in testscore gains for an individual student across years (and grades). This model is a restricted version of (1) that assumes no depreciation of past achievement, in that this model can be obtained by restricting α = 1 in (1). We show the main results for these two models below to demonstrate robustness. But we will also argue that these results suggest that model (1) is the best characterization of the data. 5. Results 5.1. The effects of MDTP taking on test scores Table 4 shows the main results. The first column is the result based on the LDV model, while the second column shows the result from a ‘gain score’ model. Both models yield a positive and significant coefficient on lagged MDTP testing, with a similar predicted gain in math achievement the following year of just around a tenth of a standard deviation.

7 The exams may become path dependent starting in grade 8 rather than grade 9. Our main results are robust when we omit grade 8 students. 8 The latter three math tests are relevant only for higher grades sample (that includes grade 9 and above) and not relevant for our main sample that uses sample of students up to grade 8. We report some results including the higher grades in the appendix in Table A2 and Table B1.

58

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64 Table 6 Heterogeneous effect of MDTP testing across initial math achievement.

Table 4 Two different models, using samples up to grade 8. Dependent variable:

(1) Math

(2) Diff in Math

MDTP t − 1

0.089∗ ∗ ∗ (0.029) 0.750∗ ∗ ∗ (0.011) 242,765

0.117∗ ∗ ∗ (0.035)

Math score t − 1 Observations

Observations 242,765

Note: The results are for the LDV model (column 1) and the gain score model without student fixed effects (column 2). Standard errors are clustered by grade∗ year. ∗ p < 0.10, ∗ ∗ p < 0.05, ∗ ∗ ∗ p < 0.01. Table 5 Testing the parallel trends assumption for model selection. Dependent variable:

(1) Math

(2) Diff in Math

MDTP t+ 2

0.080 (0.049) 0.032 (0.045) 0.084∗ ∗ ∗ (0.029) 0.750∗ ∗ ∗ (0.011) 242,765

0.091 (0.054) 0.025 (0.058) 0.110∗ ∗ ∗ (0.035)

MDTP t+ 1 MDTP t − 1 Math t − 1 Observations

MDTP t − 1

242,765

Note: The results are for the LDV model (column 1) and the gain score model without student fixed effects (column 2). Standard errors are clustered at grade∗ year level. ∗ p < 0.10, ∗ ∗ p < 0.05, ∗ ∗ ∗ p < 0.01.

Table 4 also provides evidence that the gain score model does not capture the dynamics of achievement accurately. Recall that the gain score model (2) is a restricted version of (1) in which α = 1. A test of this restriction in (1) strongly rejects the null. On this basis we prefer model (1) to the gain score model (2). With difference in difference models, a crucial identification assumption is the parallel trends assumption, which requires similar trends in test score between the grades with and without mandated testing. In our complex difference in difference case, with multiple treatment periods that vary across grades, we can test whether mandated testing one and two years in the future in the given grade is estimated to influence current year test-score gains. Although any such finding could be due to random variation, it would raise doubts that in the given model the parallel trends assumption is valid. Table 5 shows the results of these models, using specifications (1) and (2). Our preferred specification (1) and the simple gain score model (2) both pass this test. The estimated coefficient on lagged MDTP in model (1) remains significant, and the introduction of mandated testing in that grade one and two years in the future have no effect on achievement gains.9 Overall, we find that models (1) and (2) pass the parallel trends test, but that between the two models (1) is preferred because the restriction of (1) implied in (2) is strongly rejected.10 Our remaining results are based on Eq. (1).

9 Another potential variant is the gain-score model (in which the dependent variable is the change in math achievement) with student fixed effects. We check the validity for this model as well but it does not meet the validity standard. MDTP introduction one year forward is not a significant predictor of achievement gains, but introduction two years in the future is. We speculate that the known misspecification in this model (forcing the implied coefficient on lagged achievement to 1) and the addition of student fixed effects may introduce some bias (which is possible if the strict exogeneity assumption does not hold). 10 In addition, some recent evidence suggests that the gain-score model does not perform as well as the LDV model under a variety of estimation conditions. See Koedel et al. (2015) for a review.

(1) Low

(2) Medium

(3) High

0.064∗ ∗ ∗ (0.016) 80,302

0.074∗ ∗ (0.027) 80,123

0.080 (0.047) 82,340

Notes: The results are based on the LDV model. Dependent variable is the math CST test score, standardized by grade and test. All regressions include grade, year, CST subject type and school fixed effects. Other regressors are as listed in the description of Eq. (1) in the text. Standard errors are clustered at grade∗ year level and shown in parentheses. Low, medium, and high group indicates students with the bottom, medium and top third of previous year’s test score distribution, respectively. The difference in coefficient estimate on MDTP t − 1 across three groups is not statistically significant at the 10% level. ∗ p < 0.10, ∗ ∗ p < 0.05, ∗ ∗ ∗ p < 0.01.

To examine whether the effect of MDTP testing varies by initial achievement, we divide students into three equal groups based on the prior grade’s standardized CST score, namely low, medium and high math achievement. Table 6 shows the results. The coefficient on mandated testing is similar across the three models, although the estimate among the high group is imprecise.11 Thus, we do not find large or significant variations in the impact by students’ initial achievement. The results reported so far are based on sample of students up to grade 8, which is to address the potential concern that high school students take a different set of math tests based on the variety of math courses taken in high school. In Appendix B, we replicate the main results (reported in Table 4) using the fuller sample, including grade 3–11, and discuss the results. The MDTP effect becomes slightly larger, and in all cases the effect is statistically significant at the 1% level. 5.2. Identification issues 5.2.1. Sensitivity checks that vary the covariates or the sample We show in Table 7 that our main results are robust to the following concerns (column 1 shows the benchmark result from the LDV model). First, one might be concerned that teacher characteristics and class size are endogenous. But the estimates are virtually unaffected after dropping these variables (column 2). Second, we have controlled for the subject matter of the test taken this year, but it might be more appropriate to control for both the type of CST test taken in the current year and the type of test taken in the previous year. This decreases the MDTP coefficient slightly (from 0.089 to 0.075) but it remains significant at the 1% level (column 3). Column 4 controls for school-by-grade fixed effects as well as school-by-test type fixed effects. These fixed effects could account for systematic within school differences in test gains that might arise from unobserved instructional practices or school specific tracking policies. Column 5 reports the estimate controlling for pre test types fixed effects, school-by-grade fixed effects, and school-by-test type fixed effects. A separate concern is that in one of the year/grade combinations subject to mandated testing, the percentage of students tested was quite low: as Table 2 shows, in 20 04–20 05 only 23% of the grade 6 cohort took the MDTP. Therefore, as another robustness check, we dropped observations pertaining to the 2004–05 6th grade class, and report the result in column 6 of Table 7. A final possibility is that to a lesser extent grade 8 math scores suffer from the same issue as scores for higher grades, with students taking different tests depending on the level of the math class

11 In a pooled version of these models, we cannot reject (at the 10% significance level) the null hypothesis that the coefficients are identical.

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64

59

Table 7 Robustness checks.

MDTP t − 1 Observations

(1)

(2)

(3)

(4)

(5)

(6)

(7)

0.089∗ ∗ ∗ (0.029) 242,765

0.090∗ ∗ ∗ (0.029) 242,765

0.075∗ ∗ ∗ (0.026) 242,765

0.088∗ ∗ ∗ (0.027) 242,765

0.070∗ ∗ ∗ (0.022) 242,765

0.091∗ ∗ ∗ (0.028) 234,934

0.086∗ ∗ (0.036) 204,183

Note: The results are based on the LDV model. Dependent variable is the math CST test score, standardized by grade and test. All regressions include grade, year, CST subject type and school fixed effects. Other regressors are as listed in the description of Eq. (1) in the text. Column 1 shows the benchmark model. Column 2 reports the estimate without teacher characteristics and class size, and column 3 shows the result after controlling for fixed effects for types of pre and current math tests. Column 4 controls for school-by-grade fixed effects as well as school-by-test type fixed effects. Column 5 reports the estimate controlling for pre test types fixed effects, school-by-grade fixed effects, and school-bytest type fixed effects. Column 6 uses the sample after dropping observations pertaining to the 2004–05 6th grade class. Finally, column 7 drops observations pertaining to the 8th grade class. Standard errors are clustered at the grade-year level. ∗ p < 0.10, ∗ ∗ p < 0.05, ∗ ∗ ∗ p < 0.01.

in which they enroll. (As Table 3 shows, the sixth most common course-taking sequence, accounting for 3.7% of students, was for students in grade 8 to take the Geometry test in grade 8. In all the other popular sequences, students took the Algebra I test in grade 8). Column 7 reports the result after dropping observations pertaining to 8th grade class. Mandated testing remains positive and significant at the 5% level, with qualitatively similar results across models. 5.2.2. Compliance to the mandated testing Another potential identification issue is related to the fact that the average compliance rate for the cohorts who are affected by the mandate is about 0.7, reasonably high but not complete (Table 2). Given MDTP in the district became mandatory for certain cohorts, there is less likely to be a selection issue than if students were volunteered for testing by their schools. However, we are concerned that students who took MDTP might be different in certain unobserved ways, thus biasing the effects. We investigate this issue as follows. We first check whether there is any systematic difference in observable characteristics of students when they did and did not take the diagnostic test, using the following linear probability model:

P r (MDT Pit = 1

| Mandat eit = 1 ) = σs + δYi,t + γ F amil yit

+ π ClsSizeicst + θ Schoolist + η T eachericst + μ Gradeit + τ Yeart + ρ T estit + εicgst

(3)

Thus, we regress the binary indicator of MDTP-taking on a set of school dummies, the standardized CST score of the current academic year (which is administered before MDTP testing), and other characteristics included in Eq. (1), while also conditioning on the year, grade, and test type fixed effects. School fixed effects are jointly significant, indicating that schools are different in compliance rate. However, to the extent that unobserved differences across schools are constant over the four-year period covered by our panel, we have fully controlled for variations in compliance rates by including school fixed effects in our test-score models. The coefficient on the CST score, δ , indicates whether there is dynamic selectivity bias akin to Ashenfelter’s Dip, whereby a drop in test performance might later induce the school to give the student the MDTP test (Ashenfelter, 1978). The CST score in March of the current academic year does not predict the MDTP take-up in May or June in the same year (p-value > 0.1). Similarly, last year’s CST score does not predict whether the student takes the MDTP in May or June (p-value > 0.1). This finding suggests that schools are not endogenously choosing the students to whom they give MDTP tests based on students’ academic performance.12

12 Other factors may also explain the incomplete compliance. For instance, total enrollment data might include those students who stayed in the district only tem-

Table 8 IV, intention to treat, and falsification test.

MDTP t − 1

(1) Baseline

(2) IV

0.089∗ ∗ ∗ (0.029)

0.118∗ ∗ ∗ (0.040)

Mandate t − 1 Observations

242,765

242,765

(3) Intent to treat

(4) Reading 0.006 (0.010)

0.102∗ ∗ ∗ (0.035) 242,765

239,897

Note: The results are based on the LDV model. Baseline estimate is from Table 4, column 1. The IV estimate is obtained by using Mandate t − 1 as an instrument for MDTP t − 1. The first-stage F-statistic is 844.4 and corresponding p-value on exclusion of Mandate t − 1 is smaller than 0.001. The dependent variable for columns (1)-(3) is standardized CST math test score. The dependent variable for column (3) is standardized CST reading test score. All regressions include lagged dependent variable, and grade, year, CST subject type and school fixed effects. Other regressors are as listed in the description of Eq. (1) in the text. Standard errors are clustered at grade∗ year level and shown in parentheses. ∗ p < 0.10 ∗ ∗ p < 0.05 ∗∗∗ p < 0.01

5.2.3. Instrumental variable and intent to treat models, and a placebo test As a further check, we instrument MDTPt − 1 with a dummy variable set to one if in the student’s current grade and year the district had mandated that students should be administered the MDTP. If there is selectivity bias in terms of who was tested based on unobserved characteristics other than investigated above, we can reduce this potential endogeneity by instrumenting MDTPt−1 with a dummy variable set to one if in the student’s current grade and year the district had mandated that students should be administered the MDTP.13 Table 8 reports the results. We also obtain intent-to-treat estimates, in which we include the instrument itself. Both the IV estimates (column 2) and intent-to-treat estimates (column 3) are close to the baseline estimates and highly statistically significant. The IV estimate is slightly higher than the baseline estimate, suggesting that if anything, the schools that complied most strongly and the students within schools who were most likely to be given the test, had weaker test-score gains quite independent of MDTP testing. In the last column of Table 8, as a placebo test, we run the same model as Eq. (1) but using the standardized reading score as a dependent variable. Since MDTP is intended to improve math performance, we should see a weaker ef-

porarily, as little as 90 days. If these students stayed in the district at the time of the CST (in March) but not at the time of the MDTP (in May or June), they would seem like non-compliers in our dataset. The failure to take-up in this regard causes less concern once we account for student fixed effects, as student’s decision on when to move is unlikely correlated with the timing of the MDTP. 13 The instrumental variable (IV) readily passes weak instrument tests. The firststage F-statistic is 844.4 and corresponding p-value on exclusion of Mandate t-1 is smaller than 0.001. As shown in Table 1, in 2004, MDTP was mandated only for selected schools for grade 6 students and we treat them as a non-mandated cohort.

60

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64

fect of the MDTP on the reading score. Confirming our expectation, the coefficient estimate on MDTPt−1 drops significantly and it is no longer statistically significant.14

Table 9 Does MDTP affect enrollment in summer school?.

MDTP t − 1

6. Potential explanations for the positive effect of mandated MDTP testing In the following section, we present survey evidence that teachers used mandated test results both at the end of the year and during the following school year to adjust their teaching for individual students, which is the intended action resulting from diagnostic testing. Before presenting the survey evidence, however, we first examine two other potential causal mechanisms – the use of MDTP to increase the probability that lower achieving students enroll in and benefit from summer school, and more accurate placement of students into ability groups the year after they are tested. As shown below, we find little evidence in favor of either of these two hypotheses. With our extended sample (through grade 11 rather than grade 8, as discussed in the appendix), we find that these two mechanisms cannot explain the MDTP estimated effect either, in that the inclusion of controls for these mechanisms does not lower the coefficient on MDTP in the math achievement model. The district used overall student performance during a year in addition to MDTP performance to make recommendations to parents on whether their children should attend summer school.15 Does MDTP in fact influence summer school assignment? More importantly, can summer school attendance explain any of the estimated MDTP effect? Second, the district used letter grades, teacher recommendations and MDTP scores to make decisions on the level of math class into which a student would be placed in the subsequent fall of middle or high school. MDTP could thus influence achievement through its impact on the degree of ability grouping or peer effects. Given the large literature suggesting peers and ability grouping may affect student achievement (see Sacerdote, 2011 and Betts, 2011 for reviews), some of the positive effects of MDTP may be driven through these channels if differential placement takes place due to MDTP. Can we detect any evidence that this in fact took place, and that it influenced student achievement? Table 9 shows linear probability models of the probability that students attended summer school, on the corresponding sample of secondary school students. Samples are restricted to grade 7– 8 only, grades for which summer school data are available, and exclude data from grades 3 to 6. For these smaller set of grades, we make inferences using the wild bootstrap procedure as de-

14 Finally, we explore the tests suggested in Pei et al. (2016). The authors argue that, instead of comparing the coefficient on the intervention with and without additional controls, a more powerful approach is to regress each explanatory variable on the intervention. If the intervention predicts the explanatory variables, and if there is evidence that these variables in turn influence the outcome of interest it could indicate omitted variable bias. Accordingly, we check the coefficient on lagged mandated MDTP (our instrument) in 24 models, each with one of our additional explanatory variables as the dependent variable, after controlling for fixed effects for year, grade, test type, and school. Out of 24 time-varying characteristics, mandated testing is a statistically significant predictor at the five percent level for five variables. This would raise a concern if these variables are also of a significant predictor of test scores, conditional on the fixed effects. (The authors emphasize that their test makes sense only if there is a strong prior that the covariate affects the outcome of interest, namely, math achievement.) MDTP is a significant predictor for only one case – an indicator for whether the student is an English Learner. We note that Pei et al. (2016) indicate a problem with their approach: their test tends to over-reject if there is more than one explanatory variable (we have 24), and this over-rejection becomes a more serious problem if robust standard errors are calculated, as is the case in our work. Overall, the results provide only weak evidence that omitted variable bias is an issue. 15 Using a randomized field trial, Borman and Dowling (2006) find that attending summer school promotes longitudinal achievement growth. Jacob and Lefgren (20 04) and Matsudaira (20 08) use a regression discontinuity design to estimate the causal effect of summer school and find positive effect of summer schooling.

(1)

(2)

(3)

0.027 (0.012)

0.028∗ (0.011) −0.034∗ ∗ ∗ (0.007)

78,971

78,971

0.030∗ (0.012) −0.027∗ ∗ ∗ (0.009) −0.018 (0.013) 78,971

CST score t − 1 MDTP t − 1



CST score t − 1

Observations

Note: Dependent variable is an indicator for attending summer school. All regressions include grade, year, CST subject type and school fixed effects. Other regressors are as listed in the description of Eq. (1) in the text. Standard errors are clustered at grade∗ year level. Inferences are based on wild bootstrap procedure as described in Cameron et al. (2008). Samples are restricted to grades 7–8, grades for which summer school data are available.∗ p < 0.10 ∗ ∗ p < 0.05 ∗ ∗ ∗ p < 0.01 Table 10 The effect of mandate MDTP testing and summer school on math achievement. (1) Summer

(2)

(3)

(4)

0.064∗ (0.023)

−0.022 (0.012) 0.065∗ (0.023)

77,806

77,806

−0.020 (0.016) 0.065∗ ∗ (0.023) −0.004 (0.024) 77,806

−0.019 (0.012)

MDTP t − 1 MDTP t − 1∗ Summer Observations

77,806

Note: Dependent variable is math CST test score, standardized by grade and test. All regressions include grade, year, CST subject type and school fixed effects. Other regressors are as listed in the description of Eq. (1) in the text. Standard errors are clustered at the grade∗ year level. Inferences are based on wild bootstrap procedure as described in Cameron et al. (2008). Samples are restricted to grades 7–8, grades for which summer school data are available (sample size is smaller than in Table 9 since dependent variable here requires two consecutive years of test score). ∗ p < 0.10 ∗ ∗ p < 0.05 ∗ ∗ ∗ p < 0.01

scribed in Cameron et al. (2008). In each of the columns, we condition on whether the student took the mandated MDTP in the prior year. The MDTP coefficient is roughly 0.03. Column 2 shows that the math CST score in the prior year is negatively associated with attendance at summer school. Schools made recommendations on who should attend summer school before receiving these test scores, but they serve as a noisy measure of achievement, which should be positively correlated with teachers’ own assessments of student achievement. As expected, weaker students were more likely to attend summer school. Column 3 tests the key idea that MDTP testing would increase the probability that weaker students would be sent to summer school; the interaction between MDTP testing and that spring’s CST score is negative but not statistically significant. These findings naturally lead to the next questions: do students who attend summer school gain more in math achievement by the following spring, and can any summer school effects at least partly explain the positive estimated effect of MDTP on subsequent testscore gains? Table 10 models the math CST test score as a function of lagged test score, summer school attendance, and having taken the MDTP the prior spring. Column 1 suggests that summer school is associated with a roughly 0.02 standard deviation decrease in math achievement by the following spring (not statistically significant). However, a note of caution here is that we do not have a natural experimental setting that can be exploited to estimate the causal effect of summer schooling. Thus our summer school coefficient likely picks up negative selection into summer school (i.e. those who do not perform well take summer school, as suggested by the negative coefficient of CST test score in Table 9), rather than the causal effect of summer schooling. After replicating our basic model on this subsample in column 2 in Table 10, in column 3 we include both lagged MDTP and sum-

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64 Table 11 Can the effect of MDTP on math achievement be explained by resulting changes in the standard deviation of initial achievement by classroom and peer quality?. Dependent variable:

(1) SD

(2) Peer

(3) Math

(4) Math

(5) Math

MDTP t − 1

0.008 (0.005)

−0.035 (0.023)

0.135∗ ∗ ∗ (0.036)

0.133∗ ∗ ∗ (0.035) 0.153∗ ∗ ∗ (0.028)

115,032

115,032

115,032

115,032

0.142∗ ∗ (0.037) 0.002 (0.030) 0.207∗ ∗ ∗ (0.018) 115,032

SD Peer Observations

Note: The dependent variable in columns 1 is the standard deviation of lagged math achievement of classroom peers. The dependent variable in column 2 is average lagged CST score of the current classmates excluding student him/herself. The dependent variable for columns 3–5 is the math CST test score, standardized by grade and test. All regressions include grade, year, CST subject type and school fixed effects. Column 5 additionally controls for the percentage of peers taking each type of test and a dummy for own test type. Other regressors are as listed in the description of Eq. (1) in the text. Inferences are based on wild bootstrap procedure as described in Cameron et al. (2008). Standard errors are clustered by grade∗ year and shown in parentheses. Samples include grades 6–8. ∗ p < 0.10 ∗ ∗ p < 0.05 ∗ ∗ ∗ p < 0.01

mer school. Compared to column 2, the coefficient on lagged MDTP does not change much. Thus, the effect of taking MDTP on summer school placement does not seem to explain the effect of mandated MDTP. In column 4 we interact lagged MDTP and summer school, but the interaction is not significant. Again, for both the summer school dummy and the interaction of the summer school dummy and MDTP, causal inference may not be warranted as selection into summer school could bias our inferences. The main take-away message we draw from this exercise is that our MDTP coefficient does not change much and it remains positive. Next, we examine the question of whether the MDTP test affects placement in the following academic year (samples are restricted to grade 6–8, grades for which placement may start to matter). If it is true that mandated MDTP testing leads to better placement, then we should expect to see the variance of test scores, within the set of students taking a certain class, fall if the students were given the MDTP in the prior spring. The first empirical test we conduct is to regress the standard deviation of the prior spring’s standardized math test score for each current year class on an indicator variable whether a student took MDTP the prior year, along with other control variables listed in the basic specification. The coefficient estimate suggests that taking MDTP the prior year reduces the within-class standard deviation by 0.034 (p < 0.01) but only when we include the upper grades (grades 6–11) in the sample (Appendix Table A2, column 1). When we use our main sample that includes grades 6–8, the classroom level standard deviation of initial test score is not affected (column 1 in Table 11). This is not surprising, given that tracking seems to start in grade 9. Similarly, peer quality (as measured by average lagged test score of the classmates, excluding student him/herself) is not influenced by MDTP taking. In the remaining columns of Table 11, we check whether controlling for the dispersion of lagged test scores and peer quality within the current year class makes the MDTP effect shrink. The coefficient of lagged MDTP does not change much and remains statistically significant at the 5% level, when the measure of within course/class dispersion and the measure of peer quality are included. In Table A2, we report the same table as Table 11 but including upper grades where sorting could matter to a greater extent. We find the similar pattern; that is, controlling for class dispersion and peer quality, the effect of MDTP does not go away and if anything rises slightly. Thus, it appears that the MDTP effect is not driven by increased degree of sorting.

61

7. Survey-based evidence of how mandated testing improved learning What can account for the unexplained portion of the estimated impact of mandated MDTP testing? Bachofer, Zau and Betts (2012) conducted a survey of middle and high school math teachers in SDUSD, and asked them about whether and how they made use of the results of mandated MDTP tests. The results are striking. Fully 93% of teachers who administered MDTP tests under the district mandate reported making use of the results during that same school year, and 75% of teachers reported making use of mandated test results to assist students in the following school year.16 Within the year of the test, 75% of math teachers stated that they “reviewed results on my own to determine overall strengths and weaknesses” . Students and their families also received considerable feedback. For instance, 47% of teachers discussed the MDTP results with students in their classes. They also made considerable use of the student letters generated by the MDTP that are intended to inform students and their parents of students’ specific weaknesses and strengths. Almost half – 45% – of teachers reported distributing the MDTP student letters to students. Remarkably, given that the MDTP tests were given within one or two months of the end of the school year, 39% of math teachers reported that they “modified teaching to help students understand and correct misunderstandings and errors revealed by (the) test” before the end of the school year. Given the fairly large impact of the mandated testing, and given how late in the school year the tests took place, we should expect to see that during the school year after the test, math teachers would use the MDTP tests from the prior spring to help their current students. Bachofer et al. (2012) indeed find that teachers made use of these tests in ways that are very much aligned with the stated goals of the MDTP. Three quarters of math teachers reported that they used the previous spring’s MDTP results in some way during the current school year. Just under two thirds (58.5%) reported that they “reviewed results on my own to determine overall strengths and weaknesses” . Perhaps most importantly, 58.5% stated that they “modified teaching to help students understand and correct misunderstandings and errors revealed by (the) test”. Roughly a third of math teachers reported discussing the prior spring’s results at a formal meeting of the school’s math department and spending additional time working on areas in which his or her students performed poorly on the prior spring’s MDTP test. We cannot prove that these factors account for the quite large achievement gains in the year following a mandated math test, but it seems clear from these survey results that math teachers actively used test results to help their students to improve in areas revealed by the tests as areas of weakness.

8. Conclusions We find that district-mandated MDTP testing is associated with positive gains in mathematics achievement during the next year. The one-year effect sizes of roughly 0.1 in our main model are meaningful. Gains of 0.1 standard deviations translate into gains of 4 percentile points in a single year for a student initially at the 50th percentile, and 3 percentile points for a student at the 25th percentile. As a point of comparison, the effect size of MDTP testing is similar to a one standard deviation increase in teacher quality (i.e., measured by teacher-fixed effects), which is found to increase student test scores by 0.1 standard deviations (Rockoff 2004).

16

Results cited here and below come from Table 3 of Bachofer et al. (2012).

62

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64

The effect is meaningful but slightly smaller than the finding by Dee and Jacob (2011), who estimate an effect size of 0.14 in grade 8 math for the implementation of NCLB including testing plus accountability provisions. These authors find a bigger effect size of 0.23 in grade 4 math for the implementation of NCLB. Their latter finding for elementary schools is on the same order of magnitude as the estimated impact of reducing elementary school class size from roughly 22–25 students to 13–17 students in the STAR experiment, which is 0.27 for math and 0.23 for reading (Finn and Achilles, 1990). Note however the significant costs of reducing class size compared to the much smaller costs of a full-blown accountability system, and the yet less expensive cost of the MDTP’s 45 minute test, offered free throughout California and online outside California for a nominal fee. The MDTP tests were designed to help teachers identify students’ specific strengths and weaknesses so that teachers can promptly provide remedial help to each student to overcome the identified misunderstandings. In this paper, we explore two alternative mechanisms that may explain the MDTP effect, summer school and class placement, and find that they are not the main drivers of the positive effect of diagnostic testing. What can account for the rest of the effect? Based on a survey they conducted of middle and high school math teachers in the district, Bachofer et al. (2012) find that teachers made use of the MDTP test results in numerous ways in the year of the test and, significantly, in the following year. For instance, teachers reacted to test results by speaking to students in their classes about the results, sending the MDTP student letters which detail students’ results home with the students, spending time to understand their classes’ overall strengths and weaknesses and, significantly modifying their teaching to pay more attention to student misunderstandings revealed on the prior spring’s diagnostic tests. Teachers reported actively using the test results in the year of the test and even in the next grade to understand their students’ strengths and weaknesses and to adjust their teaching to address these weaknesses. These actions on the part of teachers could account for much of the observed impact of district-wide diagnostic testing.17 We believe that the context in which diagnostic testing is introduced is likely to mediate the impact of the testing. In their experimental study conducted in India, Muralidharan and Sundararaman (2010) find no effect of giving teachers test scores, but they point out that if the government of Andhra Pradesh adopted such a policy, the effects could be quite different. In the context we study, math teachers came to know that every student entering given grades had detailed diagnostic results available from the prior spring, and they appear to have used these data actively to boost the achievement of their students.18 Our finding seems timely given that most U.S. states have recently adopted one of two testing systems (Smarter Balanced and PARCC) that both aim to provide not only year end “summative” tests, but also more diagnostic “formative” assessments. These latter tests can be used by teachers throughout the school year to learn which skills students have mastered and which they have

17 Another possible channel of positive MDTP effect is that students may work to improve their weaknesses in mathematics after the MDTP test. This is possible as students, not only teachers, receive detailed reports after the MDTP test. For instance, students may not be formally enrolled in summer school but may have been engaged in other activities that promote learning in a subsequent academic year. These activities, however, are unobserved in the data thus we could not explore this hypothesis further. 18 There are other important differences between the context of this study and ours. Schools in the study in India did not use the tests to place students. A second important difference is that in our case the central district administration embraced usage of the diagnostic testing rather than receiving the tests as part of an outside intervention. A third key distinction is that in our sample teachers and schools were subject to an accountability system and were incentivized to help boost test scores.

yet to truly learn. In this regard these formative tests are similar to the MDTP. Our MDTP results do not necessarily apply to these tests. However, they offer important hope that diagnostic tests, when used systematically by a school district, can accelerate achievement.

Acknowledgments This research was funded at arms’ length by the California Academic Partnership Program (CAPP) at the California State University and the University of California. We thank Karen Bachofer, Julie Cullen, David Figlio, Yixiao Sun, Hee-Seung Yang, Ron Rode, Franciska Szucs, and Peter Bell, as well as participants in seminars at the annual meetings of the American Economic Association, UC San Diego, the CAPP Advisory Committee meeting, and the Asian Conference on Applied Micro-Economics/Econometrics at Sogang University for helpful suggestions. For their help answering our many questions about the MDTP we especially acknowledge the assistance provided by Alfred Manaster, emeritus state director of the MDTP, Bruce Arnold, emeritus state director of MDTP, Donna Ames of the MDTP San Diego office, and David Jolly and Andrea Ball, emeritus and current directors of CAPP respectively. We thank Hans Johnson, Richard Murnane, and Russell Rumberger for providing detailed comments on an earlier draft of a non-technical version of this paper. We are indebted to the referees and editor for excellent suggestions.

Appendix A

Table A1 Summary statistics. Variable

Mean

MDTPt − 1 0.13 School Demographics and Characteristics Average math class size 10.15 % of school Asian 16.03 % of school white 24.94 % of school Hispanic 43.94 % of school black 13.75 60.86 % of school free lunch % of school English Learner 22.17

Standard deviation

Minimum

Maximum

0.33

0

1

13.12 14.54 21.45 24.31 11.36 28.55 21.84

0 0 0 0 0 0 0

45 100 100 100 98.41 100 100

Teacher characteristics in math class (or in homeroom in elementary classrooms) contained classrooms), averaged across semester Average years teaching at 10.18 7.59 0 42 SDUSD Average years teaching 11.72 8.27 0 45 4.97 8.36 0 39 Avg. SDUSD years service in math classes Average years service in 4.20 7.31 0 38 math classes Average percent bachelors 6.34 24.09 0 100 in math Average any Master’s 20.72 40.30 0 100 degree Avg. full credential among 40.42 49.03 0 100 math teachers Average teacher intern 0.36 5.67 0 100 Average full authorization 11.25 31.20 0 100 in math Avg. supplemental 14.17 34.51 0 100 authorization in math Average CLAD certificate 20.78 40.22 0 100 Average female teacher 25.61 43.25 0 100 Average of white teachers 30.70 45.91 0 100 Average of black teachers 2.13 14.12 0 100

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64 Table A1 (continued) Variable

Mean

Standard deviation

Minimum

Maximum

Average of Asian teachers Average of Hispanic teachers

4.59 2.61

20.62 15.82

0 0

100 100

Student characteristics English learner Female White Black/African American Asian Hispanic Other race

0.27 0.49 0.24 0.14 0.17 0.45 0.01

0.44 0.50 0.43 0.34 0.37 0.50 0.09

0 0 0 0 0 0 0

1 1 1 1 1 1 1

Parental education Less than high school High school Some college College graduate Postgraduate college Missing

0.17 0.19 0.16 0.17 0.10 0.21

0.38 0.39 0.37 0.37 0.30 0.41

0 0 0 0 0 0

1 1 1 1 1 1

Note: N = 242,765 Table A2 Can the effect of MDTP on math achievement be explained by resulting changes in the standard deviation of initial achievement by classroom and peer quality?.

Dependent variable: MDTP t − 1

(1) SD

(2) Peer

(3) Math

(4) Math

(5) Math

−0.034∗ ∗ ∗ (0.012)

−0.025 (0.018)

0.154∗ ∗ ∗ (0.030)

0.159∗ ∗ ∗ (0.031) 0.149∗ ∗ ∗ (0.036)

210,622

212,201

210,622

210,622

0.162∗ ∗ ∗ (0.031) −0.003 (0.040) 0.238∗ ∗ ∗ (0.013) 210,622

SD Peer Observations

Note: The dependent variable in column 1 is the standard deviation of lagged math achievement of classroom peers. The dependent variable in column 2 is average lagged CST score of the current classmates excluding student him/herself. The dependent variable for columns 3–5 is the math CST test score, standardized by grade and test. All regressions include grade, year, CST subject type and school fixed effects. Column 5 additionally controls for the percentage of peers taking each type of test and a dummy for own test type. Other regressors are as listed in the description of Eq. (1) in the text. Standard errors are clustered by grade∗ year and shown in parentheses. Samples include grades 6–11. ∗ p < 0.10 ∗ ∗ p < 0.05 ∗ ∗ ∗ p < 0.01

Appendix B Our results reported in the main text are based on sample of students up to grade 8. This is to address the potential concern that high school students take a different set of math tests based on the variety of math courses taken in high school. In this ap-

Table B1 The effect of MDTP using alternative models and samples.

Dependent variable: MDTP t − 1 Math score t − 1 Observations Student FE

(1) Math

(2) Diff in math

(3) Diff in math

(4) Diff in math

0.146∗ ∗ ∗ (0.031) 0.659∗ ∗ ∗ (0.020) 342,571 No

0.210∗ ∗ ∗ (0.033)

0.154∗ ∗ ∗ (0.036)

342,571 Yes

342,571 No

0.112∗ ∗ ∗ (0.015) 0.379∗ ∗ ∗ + (0.039) 311,415 Removed by first differencing

Note: Sample of students whose grade is between 3 and 11 are used. Column 1 reports the result for the LDV model. Columns 2 and 3 report the results for the gain score model, with and without student fixed effects. Column 4 report the result based on the first-differenced version of Eq. (1) in the text. Standard errors are clustered at the grade∗ year level. ∗ p < 0.10 ∗ ∗ p < 0.05 ∗ ∗ ∗ p < 0.01 + Coefficient on lagged test score gains.

63

pendix, we report the results using the fuller sample, including grade 3–11. The first three models are: (1) the lagged dependent variable (LDV) model (i.e. the unrestricted value added model, where we regress Math test score on lagged Math test score and MDTP variables); (2) the gain score model (that models changes in achievement) with student fixed effects, and (3) the gain score model without student fixed effects. In addition to three models discussed in the text, as a fourth model, we report the result using a more general approach, which is to model the level of test scores as a function of the lagged test score and other explanatory variables, while allowing for student fixed effects. We adopt the fixed-effect approach of Anderson and Hsiao (1981) that allows for a lagged dependent variable that accounts for the student fixed effects.19 In all cases, the coefficient on MDTP t − 1 is statistically significant at the 1% level. The magnitude of the effect differs across models, with the gain score model with student fixed effects showing a slightly larger MDTP effect (column 2) compared to other models. The Anderson and Hsiao model (column 4) yields smaller but qualitatively similar results to other models.

References Anderson, T.W., Hsiao, C., 1981. Estimation of dynamic models with error components. J. Am. Stat. Assoc. 76 (375), 598–606. Ashenfelter, O., 1978. Estimating the effect of training programs on earnings. Rev. Econ. Stat. 60 (1), 47–57. Bachofer, K.V., Zau, A.C., Betts, J.R., 2012. The Impact of the Use of the Mathematics Diagnostic Testing Project in San Diego Unified School District: Teacher Survey Component. California Academic Partnership Program, Long Beach, CA. Betts, J.R., 2011. The economics of tracking in education. In: Hanushek, E.A., Machin, S., Woessmann, L. (Eds.). In: Handbook of the Economics of Education, Volume 3. North Holland, Amsterdam, pp. 341–381. Betts, J.R., Zau, A., Rice, L., 2003. Determinants of Student Achievement: New Evidence from San Diego. Public Policy Institute of California, San Francisco. Black, P., Wiliam, D., 1998. Assessment and classroom learning. Assess. Educ. 5 (1), 7–74. Borman, G.D., Maritza Dowling, N., 2006. Longitudinal achievement effects of multiyear summer school: evidence from the teacher baltimore randomized field trial. Educ. Eval. Policy Anal. 28, 25–48. Cameron, A., Gelbach, J.B., Miller, D.L., 2008. Bootstrap-based improvements for inference with clustered errors. Rev. Econ. Stat. 90 (3), 414–427. Collins, A.M., Cameron, A., Gawronski, J., Eich, M., McCready, S., 2011. Using Classroom Assessment to Improve Student Learning. National Council of Teachers of Mathematics, Reston, VA. Craig, S.G., Imberman, S.A., Perdue, A., 2013. Does it pay to get an A? School resource allocation in response to accountability ratings. J. Urban Econ. 73, 30–42. Dee, T., Jacob, B., 2011. The impact of no child left behind on student achievement. J. Policy Anal. Manage. 30 (3), 418–446. De Fraja, G., Martinez-Mora, F., 2014. The desegregating effect of school tracking. J. Urban Econ. 80, 164–177. Duflo, E., Dupas, P., Kremer, M., 2011. Peer effects, teacher incentives, and the impact of tracking: evidence from a randomized evaluation in Kenya. Am. Econ. Rev. 101 (5), 1739–1774. Finn, J.D., Achilles, C.M., 1990. Answers and questions about class size: a statewide experiment. Am. Educ. Res. J. 27 (3), 557–577. Hamilton, L.S., Stecher, B.M., Marsh, J.A., Sloan McCombs, J., Robyn, A., 2007. Standards-Based Accountability Under No Child Left Behind: Experiences of Teachers and Administrators in Three States. RAND Corporation. Hanushek, E.A., Kain, J.F., Markman, J.M., Rivkin, S.G., 2003. Does peer ability affect student achievement? J. Appl. Econ. 18, 527–544. Jacob, B.A., Lefgren, L., 2004. Remedial education and student achievement: a regression-discontinuity analysis. Rev. Econ. Stat. 86 (1), 226–244. Koedel, C., Mihaly, K., Rockoff, J.E., 2015. Value-added modeling: a review. Econ. Educ. Rev. 47, 180–195. Lefgren, L., 2004. Educational peer effects and the Chicago public schools. J. Urban Econ. 56 (2), 169–191. Matsudaira, J.D., 2008. Mandatory summer school and student achievement. J. Econ. 142 (2), 829–850.

19 To estimate the Anderson and Hsiao (1981) model, we take the first difference of the following model: Yicgst = δ Yicgs,t−1 + αi + σs + β MDT Pi,t−1 + γ F amil yit + ρ ClsSizeicst + π Schoolist + η T eachericst + η Gradeit + τ Yeart + θ T estit + εicgst , and instrument Yicgs,t−1 using Yicgs,t−2 . Note that we do not present this model using our main sample which includes grade 3 to 8, as the model requires a twice-lagged test score variable and we have too few grades of data to do so.

64

J.R. Betts et al. / Journal of Urban Economics 100 (2017) 54–64

Muralidharan, K., Sundararaman, V., 2010. The impact of diagnostic feedback to teachers on student learning: experimental evidence from India. Econ. J. 120 (546), F187–F203. Pei, Z., Jörn-Steffen, P., Schwandt, H., 2016. Poorly Measured Confounders are More Useful on the Left than on the Right. London School of Economics Working Paper.

Rockoff, J., 2004. The impact of individual teachers on student achievement: evidence from panel data. Am. Econ. Rev., Papers Proc. 94, 247–252. Sacerdote, B., 2011. Peer effects in education: how might they work, how big are they and how much do we know thus far?. In: Hanushek, E.A., Machin, S., Woessmann, L. (Eds.) Handbook of the Economics of Education, 3, pp. 249–277.

Can testing improve student learning? An evaluation of ...

Partnership for Assessment of Readiness for College and Careers. (PARCC) are designing ... of California, and it offers free diagnostic testing to math teach- .... SDUSD mandated use of MDTP tests by school year, test and grade levels. Year.

487KB Sizes 0 Downloads 196 Views

Recommend Documents

Can Testing Improve Student Learning? An Evaluation ...
College and Careers (PARCC) are designing computer-adaptive tests for ... Math Diagnostic Testing Project (MDTP) is a joint program of the California ..... teacher qualifications, fixed effects for school, year and type of test and an error term.

Can Behavioral Tools Improve Online Student Outcomes?
Nov 18, 2014 - functionality on iOS and Chrome mobile operating systems. ..... in “Proceedings of the annual symposium on computer application in medical.

Can Behavioral Tools Improve Online Student Outcomes?
Nov 23, 2014 - to design three software tools including (1) a commitment device that allows students ... at work, finish assignments for school, go to the gym, and deposit money in .... graduate rate accounts for all graduations within. 6 years. Sour

Can a Technology Enhanced Curriculum Improve Student ... - CiteSeerX
May 1, 2007 - We thank and remember Jim Kaput, who pioneered SimCalc as part of his commitment to democratizing mathematics education.

Evaluation of an E-Learning Diabetes Awareness ...
or education program by electronic means. It involves the use ..... In R. L. Street, W. R.. Gold, & T. Manning (Eds.), Health promotion and interactive ... [8] Connolly, T., Stansfield, M. and McLellan, E. (2006), Using an Online. Games-Based ...

Download [Epub] The Student Evaluation Standards: How To Improve Evaluations Of Students Full Pages
The Student Evaluation Standards: How To Improve Evaluations Of Students Download at => https://pdfkulonline13e1.blogspot.com/0761946632 The Student Evaluation Standards: How To Improve Evaluations Of Students pdf download, The Student Evaluation

Download [Epub] The Student Evaluation Standards: How To Improve Evaluations Of Students Read online
The Student Evaluation Standards: How To Improve Evaluations Of Students Download at => https://pdfkulonline13e1.blogspot.com/0761946632 The Student Evaluation Standards: How To Improve Evaluations Of Students pdf download, The Student Evaluation

Autoregressive product of multi-frame predictions can improve the ...
[9] shows an interesting application of product models to cross- lingual phone .... provement of 6.4% was seen on the development set test-dev93 and a relative ...

an evaluation of iris biometrics
technologies need to develop among the general public to help biometric ..... EyeTicket website claims that JetStream “expedites processing and reduces total ...

Can Social Tagging Improve Web Image Search?
social tagging information extracted from a social photo sharing system, such as ... When a query term is the name of a class, for example, q is “apple” and Oik.

Student Evaluation Form Example.pdf
You were given opportunities to learn the tools and procedures of the job. TECHNICAL KNOWLEDGE. Poor Fair Good Very Good Excellent. Comments: ... Student Evaluation Form Example.pdf. Student Evaluation Form Example.pdf. Open. Extract. Open with. Sign