Believe me, it’s not cheating, but some strange method”

Download Report

Transcript Believe me, it’s not cheating, but some strange method”

C R E S S T / Harvard
“Believe me, it’s not cheating,
but some strange method”
GRE/TOEFL prep teacher, Shanghai
Daniel Koretz
Harvard Graduate School of Education
National Center for Research on Evaluation, Standards, and Student Testing
Annual CRESST Conference
September 11, 2002, Los Angeles, CA
C R E S S T / Harvard
Validating inferences in the age of NCLB

Validity is a property of inferences, not of
measures

Key inferences are now about gains obtained
under high stakes conditions

Traditional validation is insufficient


Inappropriate framework

Insufficient methods
Risk is false positives: inflated gains
2
C R E S S T / Harvard
Map of talk

Will not show evidence of severe inflation—old
hat by now

Will discuss approach to validation of gains

Will illustrate possible leverage points for
coaching, inflation of scores

Will note possible directions for future
3
C R E S S T / Harvard
Why traditional validation is insufficient

Cross-sectional, insensitive to changes in levels
of performance

Insufficient in high-stakes contexts:



Largely ignores behavioral responses to
testing
Ignores inadvertent emphases in tests
Assumes stability in relationships between
aspects of performance, both tested and
untested
4
C R E S S T / Harvard
Why these limitations matter

Scores can rise rapidly—and be inflated—
without affecting correlations among tests

Behavioral responses to testing (e.g., coaching)
can make sampled content unrepresentative of
domain after initial validation

Inadvertent emphases in tests can provide
leverage for coaching
5
C R E S S T / Harvard
KY math trends, KIRIS and ACT
Standard Deviation
0.7
KIRIS
0.6
0.5
0.4
0.3
0.2
0.1
0
-0.1
ACT
1992
1993
1994
Year
6
1995
C R E S S T / Harvard
Correlations Between ACT and
KIRIS Mathematics
Student level
School level
1992
0.54
0.69
7
1993
0.71
0.75
1994
0.70
0.58
1995
0.72
0.74
C R E S S T / Harvard
Keys to validating gains

Assess generalizability of gains to other (audit)
measures

Determine how much generalizability should be
expected


Based on users’ inferences (example of
TAAS vs. NAEP)
Examine behavioral responses
8
C R E S S T / Harvard
CRESST work on the validation of gains

Develop framework for validation efforts (Tech
Report 551)

Explore teacher surveys and interviews as a
means of obtaining information behavioral
responses to testing (ongoing)

Develop statistical models for the analysis of
gains (new)
9
C R E S S T / Harvard
Framework for validating gains

Identify substantive and nonsubstantive
performance elements in test, inferences

Determine weights given to PEs in test

May be unintended

May be trivial or zero

Determine weights given to PEs in key
inferences about gains

Validity hinges on consistency of change in
performance on PEs with inference weights
10
C R E S S T / Harvard
Types of test preparation

Teaching more

Working harder

Working more effectively

Reallocation

Alignment

Coaching

Cheating
11
C R E S S T / Harvard
Reallocation

Refers to shifting limited instructional resources
among substantive areas

Within subject

Between subjects

Results in reallocating achievement

Can lead to either meaningful change or inflation

Inflates by undermining representation of the
domain
12
C R E S S T / Harvard
Alignment

Sometimes presented as providing protection
against inflation: emphasis on PEs deemed
important

But this is just a form of reallocation

Whether gains are inflated depends on


Importance of emphasized material to
inference, and
Importance of de-emphasized or omitted
material to inference
13
C R E S S T / Harvard
Coaching

Focuses on details of the test


Substantive, including item style
Non-substantive, such as item formats and
scoring rubrics

Includes test-taking tricks (e.g., POE, plug-in)

Can inflate scores or simply waste time
14
C R E S S T / Harvard
Possible levers for coaching

Possibly inadvertent content overweighting

Item style

Recurrent content detail

Recurrent form of presentation

Inadvertent, recurrent construct
underrepresentation

Recurrent cognitive demand with limited
construct relevance
15
C R E S S T / Harvard
Item from G8 MCAS
Eva has four sets of straws. The
measurements of the straws are given
below. Which set of straws could not be
used to form a triangle?
A.
B.
C.
D.
Set 1:
Set 2:
Set 3:
Set 4:
4 cm, 4 cm, 7 cm
2 cm, 3 cm, 8 cm
3 cm, 4 cm, 5 cm
5 cm, 12 cm, 13 cm
16
C R E S S T / Harvard
Item from G8 MCAS
Each arrangement in this pattern is made up of tiles.
How many tiles will be in the 6th arrangement in the
pattern?
17
C R E S S T / Harvard
Prompt from G8 MCAS
Use the balance scales below to answer the question below
18
C R E S S T / Harvard
Prompt from G10 NAEP
Use the unit of length below to estimate the perimeter of
the figure shown. Between which two consecutive
whole-number units does the perimeter lie?
19
C R E S S T / Harvard
Prompt from G10 MCAS
Use the map below to answer this question.
20
C R E S S T / Harvard
Prompt from a G8 KIRIS item
21
C R E S S T / Harvard
Prompt from G10 MCAS
Use the figure below to answer the next question
22
C R E S S T / Harvard
Answers for G10 MCAS prompt
If the figure above is
folded into a cube,
which of the following
solids will be formed?
23
C R E S S T / Harvard
Next steps for research

Develop methods for ascertaining which levers
teachers use to inflate scores

Develop methods for identifying systematically
the patterns in tests that facilitate or inhibit
coaching and inappropriate reallocation

Develop methods for ‘unpacking’ lack of
generalization and for better distinguishing
between meaningful gains and inflation
24