Inside the black box: Raising standards through classroom

Download Report

Transcript Inside the black box: Raising standards through classroom

The reliability of educational
assessments
Dylan Wiliam
www.dylanwiliam.net
Ofqual Annual Lecture, Coventry: 7 May 2009
The argument
The public understanding of the reliability of assessments is weak
Contributory factors are
The need of humans for certainty (and beliefs that its absence is chaos)
The inherent unreliability of all measurements, educational and otherwise
The use in education of tools derived from individual differences psychology
An emphasis on scores, rather than how they are used
Political assumptions about the educability of the public, combined with
A desire to use assessment outcomes as drivers of reform
Those who produce—and those who mandate the use of—assessments
must take responsibility for informed use of assessment outcomes
Dealing with uncertainty in society
People like certainty…
Hilbert (1900): “In mathematics, there is no ignoramibus”
He was wrong
And it was unsettling (Klein, 1980)
…and to attribute blame…
Deaths of children in care (e.g., “Baby P.”)
…although there are some cases where uncertainty is accepted
“It is better and more satisfactory to acquit a thousand guilty persons than to
put a single innocent one to death” (Maimonides)
“It is better that ten guilty persons escape than that one innocent suffer”
(Blackstone)
The very first high-stakes assessment…
“Then Jephthah gathered together all the men of Gilead, and fought with
Ephraim: and the men of Gilead smote Ephraim, because they said, Ye
Gileadites are fugitives of Ephraim among the Ephraimites, and among the
Manassites.
And the Gileadites took the passages of Jordan before the Ephraimites: and it
was so, that when those Ephraimites which were escaped said, Let me go
over; that the men of Gilead said unto him, Art thou an Ephraimite? If he said,
Nay;
Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he
could not frame to pronounce it right. Then they took him, and slew him at the
passages of Jordan: and there fell at that time of the Ephraimites forty and
two thousand. (Judges 12, 4-6, King James version)
Reliability
Hansen (1993) distinguishes between literal and representational
assessments
There are no literal assessments
All assessments are representational
All assessments involve generalization
Reliability is a measure of the stability of assessment outcomes under
changes in—or the ability to generalize across—things that (we think)
shouldn’t make a difference, such as:
marker/rater
occasion*
item selection*
* UK excepted
Uncertainty in assessing English
Starch & Elliott (1912)
Uncertainty in assessing mathematics
Starch & Elliott (1913)
Measures of reliability
In classical test theory, reliability is defined as a kind of “signal-tonoise” ratio (in fact a signal to signal-plus-noise ratio)
Reliability is increased
by decreasing the noise, or, easier,
by increasing the signal
Hence the need for discrimination
The legacy of individual differences psychology
A focus on discrimination between individuals
In education, more appropriate ways of estimating reliability exist
Discriminating between those who have and have not been taught
Discriminating between those who have and have not been taught well
Test length and reliability
Just about the only way to increase
the reliability of a test is to make it
longer, or narrower (which amounts
to the same thing).
To
From
0.70
0.75
0.80
0.85
0.90
0.70
1.0
0.75
1.3
1.0
0.80
1.7
1.3
1.0
0.85
2.4
1.9
1.4
1.0
0.90
3.9
3.0
2.3
1.6
1.0
0.95
8.1
6.3
4.8
3.4
2.1
0.95
1.0
Reliability is not what we really want
Take a test which is known to have a reliability of around 0.90 for a
particular group of students.
Administer the test to the group of students and score it
Give each student a random script rather than their own
Record the scores assigned to each student
What is the reliability of the scores assigned in this way?
A. 0.10
B. 0.30
C. 0.50
D. 0.70
E. 0.90
Reliability v consistency
Classical measures of reliability
are meaningful only for groups
are designed for continuous measures
Marks versus grades
Scores suffer from spurious accuracy
Grades suffer from spurious precision
Classification consistency
A more technically appropriate measure of the reliability of assessment
Closer to the intuitive meaning of reliability
Uncertainty in assessment at A-level
Classification consistency at A-level
Please, D. N. (1971) “Estimation of the proportion of examination candidates
who are wrongly graded”. Br. J. math. statist. Psychol. 24: 230-238.
Here’s the table that got me into trouble…
Classification consistency of National Curriculum Assessment in England
reliability
levels
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
KS1
3
73% 75% 77% 79% 81% 83% 86% 90%
KS2
5
56% 58% 60% 64% 68% 73% 77% 84%
KS3
8
45% 47% 50% 54% 57% 62% 68% 76%
AERA, APA, NCME Standards (4 e/d,1999)
Standard 2.1
 For each total score, subscore or combination of scores that is to be interpreted,
estimates of relevant reliabilites and standard errors of measurement or test
information functions should be reported (p. 31)
Standard 2.2
 The standard error of measurement, both overall and conditional (if relevant) should
be reported both in raw score or original scale units and in units of each derived score
recommended for use in test interpretation (p. 31, my emphasis).
Standard 2.3
 When test interpretation emphasizes differences between two observed scores of an
individual, or two averages of a group, reliability data, including standard errors,
should be provided for such differences (p. 32)
Conclusion
It is simply unethical to produce or to mandate the use of assessments
without taking steps to ensure informed use of the outcomes of the
assessments by those likely to do so.
Error bounds should be routinely estimated, and reported in terms of the
units used for reporting (e.g., grades and aggregate measures)
Government and its agencies should actively promote public understanding
of
the limitations of assessments, both in terms of reliability and other
aspects of validity
appropriate interpretations of assessment outcomes, for individuals and
groups