Reading Assessment: Still Time for a Change P. David Pearson UC Berkeley Professor andFormer Former Dean Slides available at www.scienceandliteracy.org.

Download Report

Transcript Reading Assessment: Still Time for a Change P. David Pearson UC Berkeley Professor andFormer Former Dean Slides available at www.scienceandliteracy.org.

Reading Assessment:
Still Time for a Change
P. David Pearson
UC Berkeley
Professor andFormer
Former Dean
Slides available at www.scienceandliteracy.org
Why did I pick such a boring topic?
I’m a professor!
Who needs fun?
The consequences are too grave.
I have a perverse standard of fun.
Slides available at www.scienceandliteracy.org
Valencia and Pearson (1987) Reading Assessment:
Time for a Change. In Reading Teacher
A set of contrasts between cognitively oriented views of
reading and prevailing practices in assessing reading circa
1986
New views of the reading process tell us that . . .
Yet when we assess reading comprehension, we . . .
Prior knowledge is an important determinant of reading comprehension.
Mask any relationship between prior knowledge and reading comprehension
by using lots of short passages on lots of topics.
A complete story or text has structural and topical integrity.
Use short texts that seldom approximate the structural and topical integrity of
an authentic text.
Inference is an essential part of the process of comprehending units as small
as sentences.
Rely on literal comprehension text items.
The diversity in prior knowledge across individuals as well as the varied
causal relations in human experiences invites many possible inferences to fit
a text or question.
The ability to vary reading strategies to fit the text and the situation is one
hallmark of an expert reader.
Use multiple-choice items with only one correct answer, even when many of
the responses might, under certain conditions, be plausible.
The ability to synthesize information from various parts of the text and
different texts is hallmark of an expert reader.
Rarely go beyond finding the main idea of a paragraph or passage.
The ability to ask good questions of text, as well as to answer them, is
hallmark of an expert reader.
Seldom ask students to create or select questions about a selection they may
have just read.
All aspects of a reader’s experience, including habits that arise from school
and home, influence reading comprehension.
Rarely view information on reading habits and attitudes as being as important
information about performance.
Reading involves the orchestration of many skills that complement one
another in a variety of ways.
Use tests that fragment reading into isolated skills and report performance on
each.
Skilled readers are fluent; their word identification is sufficiently automatic
to allow most cognitive resources to be used for comprehension.
Rarely consider fluency as an index of skilled reading.
Learning from text involves the restructuring, application, and flexible use of
knowledge in new situations.
Often ask readers to respond to the text’s declarative knowledge rather than
to apply it to near and far transfer tasks.
Seldom assess how and when students vary the strategies they use during
normal reading, studying, or when the going gets tough.
New views of the reading
process tell us that . . .
Yet when we assess
reading comprehension,
we . . .
Prior knowledge is an
Mask any relationship
important determinant of
between prior knowledge
reading comprehension.
and reading comprehension
by using lots of short
passages on lots of topics.
A complete story or text has Use short texts that seldom
structural and topical
approximate the structural
integrity.
and topical integrity of an
authentic text.
Inference is an essential for Rely on literal
comprehending units as
comprehension text items.
small as sentences.
New views of the reading process tell Yet when we assess reading
us that . . .
comprehension, we . . .
The diversity in prior knowledge
across individuals as well as the
varied causal relations in human
experiences invites many possible
inferences to fit a text or question.
Use multiple-choice items with only
one correct answer, even when many of
the responses might, under certain
conditions, be plausible.
The ability to synthesize information Rarely go beyond finding the main idea
from various parts of the text and
of a paragraph or passage.
different texts is hallmark of an expert
reader.
The ability to vary reading strategies
to fit the text and the situation is one
hallmark of an expert reader.
Seldom assess how and when students
vary the strategies they use during
normal reading, studying, or when the
going gets tough.
What is thinking?
You do it in your head, without a
pencil..Alexandra, age 4
You shouldn’t do it in the dark. It’s too scary,
Thomas, age 5
What is Thinking?
Thinking is when you’re doing math and
getting the answers right, Sissy, age 5
And in response…
NO! You do the thinking when you DON’T
know the answer. Alex, age 5
What is Thinking?
It’s very, very easy. The way you do it is just
close your eyes and look inside your head.
Robert, age 4
What is Thinking?
You think before you cross the street!
What do you think about?
You think about what you would look like
smashed up! Leon, age 5
What is Thinking?
You have to think in swimming class.
About what?
About don’t drink the water because maybe
someone peed in it…and don’t drown!
New views of the reading
process tell us that . . .
Yet when we assess reading
comprehension, we . . .
The ability to ask good
questions of text, as well as to
answer them, is hallmark of an
expert reader.
Seldom ask students to create or
select questions about a
selection they may have just
read.
All aspects of a reader’s
experience, including habits that
arise from school and home,
influence reading
comprehension.
Rarely view information on
reading habits and attitudes as
being as important information
about performance.
Reading involves the
Use tests that fragment reading
orchestration of many skills that into isolated skills and report
complement one another in a
performance on each.
variety of ways.
New views of the reading
process tell us that . . .
Yet when we assess reading
comprehension, we . . .
Skilled readers are fluent;
their word identification is
sufficiently automatic to
allow most cognitive
resources to be used for
comprehension.
Learning from text
involves the restructuring,
application, and flexible
use of knowledge in new
situations.
Rarely consider fluency as
an index of skilled reading.
Often ask readers to
respond to the text’s
declarative knowledge
rather than to apply it to
near and far transfer tasks.
Why did We Take this Stance?
Need a little mini-history of assessment to
understand our motives
Slides available at www.scienceandliteracy.org
The Scene in the US in the 1970s and
early 1980s
Behavioral objectives
Mastery Learning
Criterion referenced assessments
Curriculum-embedded assessments
Minimal competency tests: New Jersey
Statewide assessments: Michigan &
Minnesota
Slides available at www.scienceandliteracy.org
14
Historical relationships between instruction and assessment
Skill 1
Teach
Assess
Conclude
Skill 2
Teach
Assess
Conclude
The 1970s Skills management mentality: Teach a skill, assess it for
mastery, reteach it if necessary, and then go onto the next skill.
15
Foundation: Benjamin Bloom’s ideas of mastery learning
Skill 1
Teach
Assess
The 1970s, cont.
Conclude
Skill 2
Teach
Assess
Conclude
Skill 3
Teach
Assess
Conclude
And we taught each
of these skills until
we had covered the
entire curriculum for
a grade level.
Skill 4
Teach
Assess
Conclude
Skill 5
Teach
Assess
Conclude
Skill 6
Teach
Assess
Conclude
16
Dangers in the Mismatch we Saw in
1987
False sense of security.
Instructionally insensitive to progress on new
curricula
Accountability will do us in and force us to
teach to the tests
and all the bits and pieces.
Pearson’s First Law of Assessment
The finer the grain size at which we monitor a
process like reading and writing, the greater
the likelihood that we will end up teaching
and testing bits and pieces rather than global
processes like comprehension and
composition.
18
The ideal
The best possible assessment
teachers observe and interact with students
as they read authentic texts for genuine
purposes.
they evaluate the way in which the students
construct meaning.
intervening to provide support or suggestions
when the students appear to have difficulty.
Pearson’s Second Law of Assessment
• An assessment tool is valued to the degree
that it can approximate the good judgment
of a professional teacher!
A new conceptualization of the goal
Feature
Accuracy
Fluency
Word
Meaning
Comprehen
sion
Critique
Response
Level of Decision-Making
Beyond
School
School
IRI or Unit IRI or Unit
Test or NRT Test
Classroom
Individual
IRI
IRI
IRI
Norm
Refenced
NRT
IRI
IRI
Unit or NRT Unit
Perform
Essay
Perform
Essay
Perform
Essay
IRI
Unit
assessment
IRI or unit
activities
Discussion
Discussion
A 1987 Agenda for the Future
Pearson’s Third Law of Assessment
When we ask an assessment to serve a
purpose for which it was not designed, it is
likely to crumble under the pressure, leading
to invalid decisions and detrimental
consequences.
Early 1990s in the USA
Standards based reform
State initiatives
IASA model
Trading flexibility for accountability
Move from being accountable for the means and
leaving the ends up for grabs (doctor or lawyer
model) TO
Being accountable for the ends and leaving the
means up for grabs (carpenter or product model)
24
Mid 1990s Developments
Assessment got situated within the standards
movement
Content Standards: Know and be able to do?
Performance Standards: What counts?
Opportunity to Learn Standards: Quid pro
quo?
Standards-Based Reform
The Initial Theory of Action
Standards
Assessment
Accountability
Clear
Expectations
Motivation
Higher
Student
Learning
Ala Tucker and Resnick in
the early 1990s
26
Expanded Theory of Action
Standards
Assessment
Accountability
Clear
Exp’s
Instruction
Motivation
Professional
Development
Higher
Student
Learning
Ala Elmore and Resnick in the
late 1990s
27
The Golden Years of the 90s?
 A flying start in the late 1980s and early 1990s
 International activity in Europe, Down Under, North
America
 Developmental Rubrics
 Performance Tasks
• New Standards
• CLAS
 Portfolios of Various Sorts
• Storage bins
• Showcase: best work
• Compliance: Walden, NYC
 Increase the use of constructed response items in NRTs
Late 1980s/early 1990s:
Portfolios
Performance Assessments
Make Assessment Look Like Instruction
From which we draw
Activities
Conclusions
On standards 1-n
We engage in instructional activities, from which we collect
evidence which permits us to draw conclusions about student
growth or accomplishment on several dimensions (standards) of
interest.
29
The complexity of modern assessment
practices: one to many
Activity X
Any given activity
may offer evidence
for many standards,
e.g, responding to a
story.
Standard 1
Standard 2
Standard 3
Standard 4
Standard 5
30
The complexity of performance assessment
practices: many to one
Activity 1
Activity 2
Activity 3
Activity 4
Activity 5
Standard X
For any given
standard, there are
many activities from
which we could
gather relevant
evidence about
growth and
accomplishment, e.g.,
reads fluently
31
The complexity of portfolio assessment
practices, many to many
Activity 1
Activity 2
Activity 3
Activity 4
Activity 5
Standard 1
Standard 2
Standard 3
Standard 4
Standard 5
• Any given artifact/activity can provide evidence for many
standards
• Any given standard can be indexed by many different
artifacts/activities
32
The perils of performance assessment:
or maybe those multiple-choice
assessments aren’t so bad after all…….
Thunder is a rich source of
loudness
 "Nitrogen is not found in Ireland
because it is not found in a free
state"
33
The perils of performance assessment
"Water is composed of two gins, Oxygin and
Hydrogin. Oxygin is pure gin. Hydrogin is gin
and water.”
"The tides are a fight between the Earth
and moon. All water tends towards the
moon, because there is no water in the
moon, and nature abhors a vacuum. I
forget where the sun joins in this fight."
34
The perils of performance assessment
"Germinate: To become a naturalized
German."
"Vacumm: A large, empty space where
the pope lives.”
Momentum is something you give a
person when they go away.
35
The perils of performance assessment
 The cause of perfume disappearing is
evaporation. Evaporation gets blamed for
a lot of things people forget to put the top
on.
 Mushrooms always grow in damp places
which is why they look like umbrellas.
 Genetics explains why you look like your
father, and if you don't, why you should.
36
The perils of performance assessment
"When you breath, you inspire. When you do
not breath, you expire."
37
Post 1996: The Demise of
Performance Assessment
A definite retreat from performancebased assessment as a wide-scale tool
Psychometric issues
Cost issues
Labor issues
Political issues
38
The Remains…
Still alive inside classrooms and schools
Hybrid assessments based on the NAEP
model
multiple-choice
short answer
extended response
The persistence of standards-based
reform.
No Child Left Behind
Accountability in Spades
Every grade level reporting
Census assessment rather than sampling
(everybody takes the same test)
Disaggregated Reporting by
Income
Exceptionality
Language
Ethnicity
NCLB, continued
• Assessments for varied purposes
•
•
•
•
Placement
Progress monitoring
Diagnosis
Outcomes/program evaluation
Scientifically based curriculum too
Slides available at www.scienceandliteracy.org
There is good reason to worry about
disaggregation
School 1
School 2
42
Height of bar = average achievement; width =
number of students
Disaggregation and masking
Large N
Small N
Simpson’s Paradox?
Large N
Small N
A
A
B
B
School 1
School 2
43
Disaggregation: Damned if we do and
damned if we don’t
Don’t report: render certain groups invisible
Do report: blame the victim (they are the
group that did not meet the standard.
44
Pearson’s Fourth Law of Assessment
Disaggregation is the right approach to
reporting results. Just be careful where the
accountability falls.
Pearson’s Fourth Law: A Corollary
Accountability, in general, falls to the lowest
level of reporting in the system.
Assessment can be the friend or the enemy of
teaching and learning
The curious case of DIBELS, … and other
benchmark assessments
The Dark Side
A word about benchmark
assessments…
 The world is filled with assessments that
provide useful information…
 But are not worth teaching to
 They are good thermometers or dipsticks
 Not good curriculum
48
The ultimate assessment dilemma…
 What do we do with all of these timed tests of
fine-grained skills:
 Words correct per minute
 Words recalled per minute
 Letter sounds named per minute
 Phonemes identified per minute
 Scott Paris: Constrained versus unconstrained
skills
 Pearson: Mastery skills versus growth constructs
Why they are so seductive
 Mirror at least some of the components of the
NRP report
 Correlate with lots of other assessments that have
the look and feel of real reading
 Takes advantage of the well-documented finding
that speed metrics are almost always correlated
with ability, especially verbal ability.
 Example: alphabet knowledge
90% of the kids might be 90% accurate but…
They will be normally distributed in terms of
LNPM
How to get a high correlation between a
mastered skill and something else
Letter Name
Fluency
(LNPM)
Letter Name
Accuracy
The wider the distribution of scores, the greater the
likelihood of obtaining a high correlation
Face validity problem: What virtue is there in
doing things faster?
naming letters, sounds, words, ideas
What would you do differently if you knew
that Susie was faster than Ted at naming X, Y,
or Z???
Why I fear the use of these tests
They meet only one of tests of validity:
criterion-related validity
correlate with other measures given at
the same time--concurrent validity
predict scores on other reading
assessments--predictive validity
Fail the test of curricular or face validity
They do not, on the face of it, look like
what we are teaching…especially the
speeded part
Unless, of course, we change
instruction to match the test
Really fail the test of consequential
validity
Weekly timed trials instruction
Confuses means and ends
Proxies don’t make good goals
The Achilles Heel: Consequential
Validity
Give DIBELS
Give Comprehension Test
Use results to craft instruction
Give DIBELS again
Give Comprehension Test
The emperor has no clothes
57
The bottom line on so many of these
tests
Pearson’s Third Law again
New Bumper Sticker
Never send a test out to do a
curriulum’s job!
The dark side of alignment: the transfer
problem
I agree about the importance of curriculumbased assessment and situated learning, BUT…
We do expect what you learn in one context to
assist you in others
In our heart of hearts we do NOT believe that kids
learn ONLY what you teach OR
That only what is tested is what should get learned
(and taught)
Note our strong faith in the idea of application
59
How do we test for transfer?
 A continuum of cognitive distance
 An example: Learn about the structure of
texts/knowledge about insect societies--bees, ants,
termites
 New passages
 Paper wasps
 A human society
 A biome
 How far will the learning travel?
 Our problem today: THIS IDEA OF TRANSFER IS NOT
EVEN ON OUR CURRENT RADAR SCREEN!!!
 And it ought to be!!!!!
60
Domain representation
 If we teach to the standards and the assessments,
will we guarantee that all important aspects of
the curriculum are covered?
 Linn and Shepard study: improvements on a narrow
assessment do not transfer to other assessments
 Shepard et al: in high stakes districts, high
performance on consequential assessments comes at a
price...
61
Linn and Shepard’s work...
= New Standardized Test
= Old Standardized Test
Year
1
2
3
4
5
62
Shepard et al work
ST = consequential standardized
assessment
AA = more authentic assessment of
the the consequences of
Note
same skill domain
high stakes on alternative
assessments
ST
AA
High Stakes Schools
ST
AA
Low Stakes Schools
63
Key Concept: Haladyna
Test Score Pollution: a rise or
fall in a score on a test without
an accompanying rise or fall in
the cognitive or affective
outcome allegedly measured by
the test
64
Aligning everything to the standards:
A model worth rejecting
Standards
Assessment
Instruction
•This model is likely to shape the
instruction too narrowly.
•Lead to test score pollution.
65
A better way of thinking about the link
between standards, instruction and
assessment
Standards: How we operationalize our
values about teaching and learning
Guide the development of both instruction and assessment
Teaching and
Learning
Activities
The logic of lots of good
reform projects!
Assessment
Activities
This relationship can operate at
the regional or local level
Pearson’s Fifth Law of Assessment
Alignment is a double-edged sword. If there
must be alignment, lead with the instruction
and let the assessment follow.
Pearson’s Sixth Law
High Stakes will corrupt any assessment, no
matter how virtuous or pure in intent
Corollary to Pearson’s Fifth and Sixth
Laws
The worst
possible
combination is
high stakes and
low challenge
Hgh
Stakes
and Low
Challenge
So how did we do in responding the the
challenges from Valencia & Pearson?
Issue
Grade
Solution
Prior Knowledge
D
Choice of Passages
Authentic Text
B+
Things are lots better on lots of
comprehension assessments
Inference
B
Depends on the test
Diversity in Knowledge means
diversity in response
D
Constructed response and multiple
correct answers or graded answers
Flexible use of strategies
C
Hard to assess; easy to coach; I’d
abandon except for diagnostic
interviews
Synthesizing Information is
paramount
D
Still too much emphasis on details
So how did we do in responding the the
challenges from Valencia & Pearson?
Issue
Grade
Solution
Asking questions as an index
of comprehension
D
No progress except in informal
classroom assessment
Measuring habits, attitudes,
and dispositions
C
Some reasonable things out there.
But no teeth
Orchestrating many skills
D
Too many mastery skills; not
enough growth skills
Fluency
D
Made a fetish out of it
Transfer and application
D
Limited to a few situations
Overall Grade
D
Lots of work to do
Where should we be headed?
So, what makes sense for a district or school?
Develop an educational improvement system
72
Elements of an Educational
Improvement System
 Standards, yes
 Assessments, yes
 Outcome assessments for program evaluation
 Benchmark assessments for monitoring individual progress
 “Closer look” diagnostic assessments for determining
individual student emphases
 Reporting system, yes as long as we are prepared to
live with the dilemmas of disaggregation
 Alignment, but of a different sort
73
Outcome assessments
Drop in out of the sky
Curriculum independent
Assess reading in their most global aspects
Growth constructs NOT mastery constructs
Could be some sort of standardized
assessment
Slides available at www.scienceandliteracy.org
74
A plan for early reading benchmark
assessments
Every so often, give four benchmark assessments.
75
Benchmarks for Intermediate and
Secondary
Comprehend
Deconstruct:
Narratives
Response to
Literature
Author’s Craft Creative
Writing
Information
Genres
Summaries,
Charts, Key
ideas
Genre (form
follows
function)
What do authors
do and why
Compose
Writing from
sources to
convey ideas
76
Closer Look Assessments
There is no sin in examining the infrastructure
of reading
Really do need to know which of those pieces
kids have and have not mastered
Question is what to do about them
Teach to and practice the weak bits
Rely on strengths to bootstrap the weaknesses
Just read more “just right” material
Teaching to Weaknesses Flaw
Basic Skills Conspiracy of Good Intentions:
First you gotta get the words right and the
facts straight before you can do the what ifs
and I wonder whats?
Monitoring Conditions of Instruction
Collect data on curriculum, instructional
practices
We need clear data on enacted curriculum and
instructional practices in order to link it as
precisely as possible to achievement
Use data for program improvement
Design professional development
79
Return to the hard work on
assessment
 Encouraged by recent funding of new century
assessments
 Could be some good coming out of our reading for
understanding assessment grants in the US
 Possibilities in the Australian work: NAPLAN??
 Tests that take the high road (tests worth teaching to)
 Focus on making and monitoring meaning
 Focus on the role of reading in knowledge building and the
acquisition of disciplinary knowledge
 Focus on critical reasoning and problem solving
 Focus on representation of self.
The unfinished business from the 1990s
Where Could we Be Headed:
A Near Term Research Agenda
The Development of More Trustworthy, More
Useful Curriculum-Based Assessments
Expanding the logic of the Informal Reading
Inventory
Getting comprehension assessment right
Computerized Assessments (yes, but no time
today)
81
Expanding the logic of the IRI
 Benchmark books model ala Reading Recovery
 Indices of…
 Level of text one can read independently
 Accuracy (including error patterns)
 Fluency
 Comprehension
 Not one, not two, not three, but many, many
conceptually and psychometrically comparable
passages at every level of text challenge.
82
Comprehension
Assessment
Our models for external assessment, modeled
after some of the better wide-scale
assessments, are OK.
We desperately need a school/classroom tool
that does for comprehension what running
records/benchmark books have done for oral
reading accuracy and fluency
83
Disciplinary Grounding
We’re much better off if we ground our
comprehension assessments in the inquiry
and knowledge traditions of the disciplines
rather than to
Pearson’s (bet on a) Seventh Law of
Assessment
Comprehension assessment begins and ends
within the knowledge traditions and inquiry
processes of each discipline
Pearson’s (bet on a) a Corollary to the
Seventh Law
Summative (big external) assessments of
reading comprehension will be better if they
begin as formative (smaller internal)
assessments of reading comprehension within
the knowledge traditions and inquiry
processes of each discipline.
My bottom line
 Tests that are
 Instructionally sensitive
 Psychometric sound
 Trustworthy
 No decision of consequence should be based upon a
single indicator.
 Tests are a means to an end:.
87
To reduce it to a single idea
Six, maybe seven laws
Two, maybe three corollaries
But only one thing truly worth remembering…
Never send a test out to do a
curriulum’s job!
Slides available at www.scienceandliteracy.org
Coda in Stuart McNaughton’s Spirit
A new bumper sticker with a tinge of
optimism.
Tests in support of teaching and
learning.
Computerized Assessment
 With advances in voice recognition, we are close to
being able to teach computers to recognize and score
students’ oral responses
 Applications:
 Listen to oral reading of benchmark passages and conduct
a first level diagnosis (thus eliminating a key barrier, time,
to more widespread use of this important diagnostic tool).
91
Computerized Assessment in Early
Literacy
 More applications of voice recognition
 Phonemic awareness tasks
 Word reading tasks
 Phonics tests (both real words and synthetic words)
 Comprehension assessment
• still a way down the road because of the interpretive problem
• The computer has to both listen to and understand the
response
• BARLA: Bay Area Reading and Listening Assessment
92