Transcript Slide 1

“Value added” measures of teacher quality: use and policy validity

Sean P. Corcoran New York University NYU Abu Dhabi Conference January 22, 2009

Overview

 An introduction to the use of “value added” measures (VAM) of teacher effectiveness – in both research and practice.  A discussion of the policy validity of VAM – motivated by current work on “teacher effects” on multiple assessments of similar skills. With:  Jennifer L. Jennings (Columbia U)  Andrew A. Beveridge (Queens College)

What are “value added” measures?

 Essentially, an indirect estimate of a teacher’s contribution to learning, measured using gains in students’ standardized test score results  What makes them “indirect?”   Uses a statistical model to account for certain student characteristics (key: past achievement), attributing remaining test score gains to the teacher Clearly an improvement over test score levels

What are “value added” measures?

 Generally, “teacher effects” cannot be separated from “classroom effects”  E.g. two classrooms of similarly situated students where one has a particularly disruptive student  May be able to improve VAM with multiple years of results for teachers  This approach raises a range of additional issues and questions, some of which I will address in a moment

Growth in VAM

 VAM of teacher effectiveness were initially mostly of academic interest   Rivkin et al. (2005): effect size of .10/.11 SD for reading/math Nye et al. (2004): 25-75 th percentile shift in teacher quality increased reading/math by .35/.48 SD

Growth in VAM

 Value added assessment of teachers is becoming widespread practice in the U.S.

 Houston , Dallas, Denver, Minneapolis, Charlotte   EVASS New York City – for now a “development tool” only  The Teacher Data Tool Kit

Why the sudden interest?

1.

• • • A logical extension of school accountability Movement to collect, publicly report student achievement measures at the school level In some cases, rewards and sanctions (e.g. NCLB) Common sense appeal (both Obama and McCain supported “pay for performance” for teachers)

Why the sudden interest?

2.

• • Data availability Large longitudinal databases of student performance enabled these calculations Concurrent advancements in methodology

Why the sudden interest?

3.

• • Improving our assessment and measurement of teacher quality Easily observed characteristics of teachers are often poor predictors of classroom achievement (Hanushek and Rivkin 2006) Especially true of qualifications for which teachers are remunerated (e.g. education, certification, experience)

Issues with VAM (to name a few…)

1.

 Focus on a narrow measure of educational outcomes: does “the test” adequately reflect our expectations of the educational system?

E.g. skill content, short-term vs. long-term benefits 2.

 Validity: assuming “the test” reflects outcomes we care about, is the instrument a valid one?

Teaching to the test and test inflation (Koretz 2007) – even “good” tests lose validity over time

Issues with VAM (to name a few…)

3.

   Modeling for causal inference: how can we be confident that our VAM are providing “good” estimates of the teachers true (i.e. causal) contribution to student learning?

Students are not randomly assigned to teachers Dynamic tracking “Teacher effects” may be context dependent

Issues with VAM (to name a few…)

4.

 Precision Estimates of teacher effects are just that: estimates   Each student’s test score gain is a small—and noisy—indicator of teacher effectiveness Are our estimates precise enough to base personnel decisions on them?

Issues with VAM (to name a few…)

5.

 Other Perverse incentives (gaming / cheating)     Subject dependency Persistence Scaling issues – e.g. ceiling effects Missing data – e.g. absent or exempted students

The “policy validity” of VAM

 Do VAM of teacher effectiveness have “policy validity?” That is, are they appropriate for practical implementation, and for what purposes? (Harris 2007)  If one were to make personnel decisions based on VAM, at the very least these measures should be:  Convincing as “causal” estimates  Relatively precise

Our research question

 If VAM are meaningful indicators of teacher effectiveness, they should be relatively consistent across alternative assessments of the same skills (especially for narrowly defined skills)  In most cases we only observe one assessment – the “high stakes” state assessment – upon which teacher effects are estimated

Houston

 Houston is somewhat unique in that one can observe two measures of student achievement:  TAKS – a “high stakes” exam   Stanford 10 – a “low stakes” exam Both test reading and math skills  How consistent are VAM of effectiveness on these two tests?

Houston data and method

 Longitudinal student-level data on all students in the Houston ISD, 1998 – 2006 (we use 2003-06)  Students are linked to their teachers   Student background About 127,000 students  We estimate teacher effects for 4 th and 5 th teachers on both TAKS and Stanford tests  Using 1 and 3 years of results grade

Correlation across tests

Correlation coefficient Low- and high- stakes reading Low- and high-stakes mathematics

0.34

0.41

40 10 0 30 20 50 60

Teacher effects on multiple tests

Teacher Fixed Effects on TAKS and Stanford Math Tests

Low-Stakes 1 Low-Stakes 2 Low-Stakes 3 Low-Stakes 4 Low-Stakes 5 High-Stakes 1 High-Stakes 3

High-Stakes Quintile

High-Stakes 5

Teacher effects on multiple tests (one year of data only)

HLM Estimates of 4th and 5th Grade Teacher Effectiveness on TAKS and Stanford Reading Tests, 2006

60 50 40 30 20 10 0 Low-Stakes 1 Low-Stakes 2 Low-Stakes 3 Low-Stakes 4 Low-Stakes 5 High-Stakes 1 High-Stakes 3

High-Stakes Quintile

High-Stakes 5

Teacher effects on multiple subjects

HLM Estimates of 4th and 5th Grade Teacher Effectiveness on TAKS Reading and Math Tests, 2006

60 10 0 50 40 30 20 Reading 1 Reading 2 Reading 3 Reading 4 Reading 5 Math 1 Math 3

Math Quintile

Math 5

60 50 40 10 0 30 20

Teacher effect stability

HLM Estimates of 4th and 5th Grade Teacher Effectiveness on TAKS Reading Tests, 2005 and 2006

2006 Quintile 1 2006 Quintile 3

2006 Quintile

2006 Quintile 5 2005 Q1 2005 Q2 2005 Q3 2005 Q4 2005 Q5

Conclusions

   Teachers who are good at promoting growth on a high-stakes test are not necessarily those who are good at promoting growth on a low-stakes tests of the same subject.

Teacher effects vary significantly across years and subjects Useful for policy? Probably—but we should resist relying too heavily on these measures  Of course, more research is needed!