The Tortured History of Reading Comprehension Assessment P. David Pearson UC Berkeley Professor andFormer Former Dean Slides available at www.scienceandliteracy.org.
Download
Report
Transcript The Tortured History of Reading Comprehension Assessment P. David Pearson UC Berkeley Professor andFormer Former Dean Slides available at www.scienceandliteracy.org.
The Tortured History of Reading
Comprehension Assessment
P. David Pearson
UC Berkeley
Professor andFormer
Former Dean
Slides available at www.scienceandliteracy.org
___________
__________
__________
___________
__________
• _________
• _________
The guiding questions…
Are there lessons from the past?
Is there hope for the future?
Will we ever get it right?
My predictions
Are there lessons from the past? MAYBE
YES
Is there hope for the future?
MAYBE
MAYBE
Will we ever get it right?
MAYBE
NO
Why me?
Have lived the tensions
in comprehension
assessment, but
An unwitting and
unwilling candidate for
the task?
Comprehension
assessment was on the
way to more important
goals.
Mea Culpa
Those whose work I omit
Those whose work I DON’T omit
Why now?
Post-NCLB uneasiness among practitioners that the code,
as important as it is, may not be the point of reading
New Rhetorical Moves
Deeper Learning
21st Century Skills
Reading-Writing Connections
Common Core Standards
• Read like a detective; write like a reporter
Common Core Standards: The substance
New R&D: Reading for Understanding
New psychometric tools
7
Why now? More…
National thirst for accountability requires
impeccable measures (both conceptually and
psychometric)
When the stakes are high, so too must be the
standards
Pleas of teachers desperate for useful tools
Comprehension analogue of running records of
oral reading as indices of fluency and accuracy
8
The real need…
Theoretically elegant, yes
Even more need for everyday monitoring tool
9
LESSONS FROM THE PAST
There are many
A few of my favorites
Lesson #1: No matter how hard we
try, we never “see” reading
comprehension.
artifacts
edzucators
If we never see the click of
comprhension…
What criterion do we adopt as our gold
standard for determining the validity the
various markings we have to live with?
Cognitive interview
• Illinois Goal Assessment Program
• NAEP and most other assessments
The immediate apprehension of understanding
• The sense of getting it!
– Self ratings and think alouds
Criterion variables worth predicting
Lesson #2: No matter how novel you think
your idea is, you can find a precedent for it
that is at least a century old
I know! Let’s use open ended
performance items on our high
stakes state test!
Tidbits from the 1919 Wisconsin State
High School Reading Exam
Write a few memory gems from your
literature experiences in school.
Name three novels you have read this
year and give the plot of each.
What is the significance of the the “letter”
in Hawthorne’s The Scarlet Letter?
Define these terms: poetic license,
flashback, and simile.
So as long as we are forced to
use MC formats, let’s have the
students pick more than one
right andwer!
A curious example from early 1900s
“Every one of us, whatever our speculative
opinion, knows better than he practices, and
recognizes a better law than he obeys.”
Check two of the following statements with the
same meaning as the quotation above.
To know right is to do the right.
Our speculative opinions determine our
actions.
Our deeds often fall short of the actions we
approve.
Our ideas are in advance of our everyday
behavior.
From Thurstone, undated circa 1910
21
So what if we use an error
detection paradigm?
Chapman 1924
Find the statements in Part 2 of the paragraph
that don’t fit the statements in Part 1…
I know! Let’s do think alouds to
get at what students are really
doing while they answer the
questions.!
Touton and Berry (1931) Error
analyses
(a) failure to understand the question
(b) failure to isolate elements of “an involved
statement” read in context
(c) failure to associate related elements in a context
(d) failure to grasp and retain ideas essential to
understanding concepts
(e) failure to see setting of the context as a whole
(f) other irrelevant answers
25
Lesson #3: Grain size really matters
in reading assessment, even within
comprehension assessment
The Scene in the US in the 1970s and
early 1980s
Behavioral objectives
Mastery Learning
Criterion referenced assessments
Curriculum-embedded assessments
Minimal competency tests: New Jersey
Statewide assessments: Michigan &
Minnesota
Slides available at www.scienceandliteracy.org
27
Historical relationships between instruction and assessment
Skill 1
Teach
Assess
Conclude
Skill 2
Teach
Assess
Conclude
The 1970s Skills management mentality: Teach a skill, assess it for
mastery, reteach it if necessary, and then go onto the next skill.
28
Foundation: Benjamin Bloom’s ideas of mastery learning
Skill 1
Teach
Assess
The 1970s, cont.
Conclude
Skill 2
Teach
Assess
Conclude
Skill 3
Teach
Assess
Conclude
And we taught each
of these skills until
we had covered the
entire curriculum for
a grade level.
Skill 4
Teach
Assess
Conclude
Skill 5
Teach
Assess
Conclude
Skill 6
Teach
Assess
Conclude
29
Sue’s grandmother lives on a farm. Ellen’s grandmother
lives in the city. Sue’s grandmother, who just turned 55,
phones Sue every month. Ellen’s grandmother, who is
also 55, sends Ellen e-mails several times a week. Both
grandmothers love their granddaughters.
How are Sue and Ellen’s grandmothers alike?
They both love their granddaughters
They both use e-mail
They both live on a farm
How are they different?
They live in different places
They have different color hair
They are different ages
Fast Forward to 2002
These specific skill tests have not gone away
Today’s standards are yesterday’s objectives or
skills
Skill 1
/standard
Teach
Assess
The 2000s
Conclude
/standard
Skill 2
Teach
Assess
Conclude
Skill 3
/standard
Teach
Assess
Conclude
Skill 4
/standard
Teach
Assess
And we taught each
of these standards
until we had covered
the entire
curriculum for a
grade level.
Conclude
Skill 5
/standard
Teach
Assess
Conclude
Skill 6
/standard
Teach
Assess
Conclude
32
A word about benchmark
assessments…
The world is filled with assessments that
provide useful information…
But are not worth teaching to
They are good thermometers or dipsticks
Not good curriculum
33
The ultimate assessment dilemma…
What do we do with all of these timed tests of
fine-grained skills:
Words correct per minute
Words recalled per minute
Letter sounds named per minute
Phonemes identified per minute
Scott Paris: Constrained versus unconstrained
skills
Pearson: Mastery constructs versus growth
constructs
Why they are so seductive
Mirror at least some of the components of the
NRP report
Correlate with lots of other assessments that have
the look and feel of real reading
Takes advantage of the well-documented finding
that speed metrics are almost always correlated
with ability, especially verbal ability.
Example: alphabet knowledge
90% of the kids might be 90% accurate but…
They will be normally distributed in terms of
LNPM
How to get a high correlation between a
mastered skill and something else
Letter Name
Fluency
(LNPM)
Letter Name
Accuracy
The wider the distribution of scores, the greater the
likelihood of obtaining a high correlation with a criterion
Face validity problem: What virtue is there in
doing things faster?
naming letters, sounds, words, ideas
What would you do differently if you knew
that Susie was faster than Ted at naming X, Y,
or Z???
Why I fear the use of these tests
They meet only one of tests of validity:
criterion-related validity
correlate with other measures given at
the same time--concurrent validity
predict scores on other reading
assessments--predictive validity
Fail the test of curricular or face validity
They do not, on the face of it, look like
what we are teaching…especially the
speeded part
Unless, of course, we change
instruction to match the test
Really fail the test of consequential
validity
Weekly timed trials instruction
Confuses means and ends
Proxies don’t make good goals
The Achilles Heel: Consequential
Validity
Give DIBELS
Give Comprehension Test
Use results to craft instruction
Give DIBELS again
Give Comprehension Test
The emperor has no clothes
42
Collateral Damage
Tight link between instruction and assessment
Assess at a low level of challenge
Basic Skills Conspiracy
First you have to get all the words right and all
the facts straight before you can do the what
ifs and I wonder whats.
The bottom line on so many of these
tests
Never send a test out to do a
curriulum’s job!
Lesson #4: It is very difficult to
oust the incumbent
• Two mini-case studies
• Unconventional state
assessments
• Performance assessments
Valencia and Pearson (1987) Reading Assessment:
Time for a Change. In Reading Teacher
A set of contrasts between cognitively oriented views of
reading and prevailing practices in assessing reading circa
1986
New views of the reading process tell us that . . .
Yet when we assess reading comprehension, we . . .
Prior knowledge is an important determinant of reading comprehension.
Mask any relationship between prior knowledge and reading comprehension
by using lots of short passages on lots of topics.
A complete story or text has structural and topical integrity.
Use short texts that seldom approximate the structural and topical integrity of
an authentic text.
Inference is an essential part of the process of comprehending units as small
as sentences.
Rely on literal comprehension text items.
The diversity in prior knowledge across individuals as well as the varied
causal relations in human experiences invites many possible inferences to fit
a text or question.
The ability to vary reading strategies to fit the text and the situation is one
hallmark of an expert reader.
Use multiple-choice items with only one correct answer, even when many of
the responses might, under certain conditions, be plausible.
The ability to synthesize information from various parts of the text and
different texts is hallmark of an expert reader.
Rarely go beyond finding the main idea of a paragraph or passage.
The ability to ask good questions of text, as well as to answer them, is
hallmark of an expert reader.
Seldom ask students to create or select questions about a selection they may
have just read.
All aspects of a reader’s experience, including habits that arise from school
and home, influence reading comprehension.
Rarely view information on reading habits and attitudes as being as important
information about performance.
Reading involves the orchestration of many skills that complement one
another in a variety of ways.
Use tests that fragment reading into isolated skills and report performance on
each.
Skilled readers are fluent; their word identification is sufficiently automatic
to allow most cognitive resources to be used for comprehension.
Rarely consider fluency as an index of skilled reading.
Learning from text involves the restructuring, application, and flexible use of
knowledge in new situations.
Often ask readers to respond to the text’s declarative knowledge rather than
to apply it to near and far transfer tasks.
Seldom assess how and when students vary the strategies they use during
normal reading, studying, or when the going gets tough.
New views of the reading
process tell us that . . .
Yet when we assess
reading comprehension,
we . . .
Prior knowledge is an
Mask any relationship
important determinant of
between prior knowledge
reading comprehension.
and reading comprehension
by using lots of short
passages on lots of topics.
A complete story or text has Use short texts that seldom
structural and topical
approximate the structural
integrity.
and topical integrity of an
authentic text.
Inference is an essential for Rely on literal
comprehending units as
comprehension text items.
small as sentences.
New views of the reading process tell Yet when we assess reading
us that . . .
comprehension, we . . .
The diversity in prior knowledge
across individuals as well as the
varied causal relations in human
experiences invites many possible
inferences to fit a text or question.
Use multiple-choice items with only
one correct answer, even when many of
the responses might, under certain
conditions, be plausible.
The ability to synthesize information Rarely go beyond finding the main idea
from various parts of the text and
of a paragraph or passage.
different texts is hallmark of an expert
reader.
The ability to vary reading strategies
to fit the text and the situation is one
hallmark of an expert reader.
Seldom assess how and when students
vary the strategies they use during
normal reading, studying, or when the
going gets tough.
New views of the reading
process tell us that . . .
Yet when we assess reading
comprehension, we . . .
The ability to ask good
questions of text, as well as to
answer them, is hallmark of an
expert reader.
Seldom ask students to create or
select questions about a
selection they may have just
read.
All aspects of a reader’s
experience, including habits that
arise from school and home,
influence reading
comprehension.
Rarely view information on
reading habits and attitudes as
being as important information
about performance.
Reading involves the
Use tests that fragment reading
orchestration of many skills that into isolated skills and report
complement one another in a
performance on each.
variety of ways.
New views of the reading
process tell us that . . .
Yet when we assess reading
comprehension, we . . .
Skilled readers are fluent;
their word identification is
sufficiently automatic to
allow most cognitive
resources to be used for
comprehension.
Learning from text
involves the restructuring,
application, and flexible
use of knowledge in new
situations.
Rarely consider fluency as
an index of skilled reading.
Often ask readers to
respond to the text’s
declarative knowledge
rather than to apply it to
near and far transfer tasks.
Why did We Take this Stance?
Need a little mini-history of assessment to
understand our motives
Slides available at www.scienceandliteracy.org
The Scene in the US in the 1970s and
early 1980s
Behavioral objectives
Mastery Learning
Norm-referenced assessments
Criterion referenced assessments
Curriculum-embedded assessments
Minimal competency tests: New Jersey
Statewide assessments: Michigan &
Minnesota
Slides available at www.scienceandliteracy.org
52
Cognitive perspectives claim that we
had
Paid too much attention to measurement
theory and
Not enough to reading theory
53
Authentic Texts
Select, not construct, texts for understanding
Can’t tinker with the text to rationalize items
and distractors
54
More than one right answer
How does Ronnie reveal his interest in Anne?
Ronnie cannot decide whether to join in the conversation.
Ronnie gives Anne his treasure, the green ribbon.
Ronnie gives Anne his soda.
Ronnie invites Anne to play baseball.
During the game, he catches a glimpse of the green ribbon
in her hand.
55
Rate all of the responses on some
scale of relevance
How does Ronnie reveal his interest in Anne?
(2)(1)(0) Ronnie cannot decide whether to join in the
conversation.
(2)(1)(0) Ronnie gives Anne his treasure, the green ribbon.
(2)(1)(0) Ronnie gives Anne his soda.
(2)(1)(0) Ronnie invites Anne to play baseball.
(2)(1)(0) During the game, he catches a glimpse of the
green ribbon in her hand.
Best predictor of retelling scores
56
Include
Prior knowledge
Metacognition
Habits, attitudes, and dispositions
57
Initiatives
IGAP in Illinois
MEAP in Michigan
Alternative assessment systems in lots of
states
Kentucky
Vermont
Maryland
Washington
CLAS
Some findings from the ill-fated IGAP
Comprehension plus PK, Metacognition,
Habits/Attitude
Factor Analyses (Pearson, et al, 1991)
demonstrated three reliably separable factors
Metacognitive stances
habits/attitudes items
a combination of the comprehension and prior
knowledge items (could not separate them)
59
Fate
Went the way of all tests that challenge the
conventional wisdom
More than one right answer was NOT the right
answer
Intentionally validated for group decisions not
individual (as accountability changed…)
Parts of it were corruptible:
Not good to teach to (e.g. metacognitive items)
60
Lost opportunities
Selecting or rating answers really is a good
idea
Maximizes information and reliability
Legacies
Authentic passages
Text analyses to determine importance
Item design process
The infusion of reading theory to stand
alongside
The Golden Years of the 90s?
A flying start in the late 1980s and early 1990s
International activity in Europe, Down Under,
North America
Developmental Rubrics
Performance Tasks
Portfolios of Various Sorts
Increase the use of constructed response items in
NRTs (including NAEP)
Late 1980s/early 1990s:
Portfolios
Performance Assessments
Make Assessment Look Like Instruction
From which we draw
Activities
Conclusions
On standards 1-n
We engage in instructional activities, from which we collect
evidence which permits us to draw conclusions about student
growth or accomplishment on several dimensions (standards) of
interest.
64
The complexity of performance assessment
practices: one to many
Activity X
Any given activity
may offer evidence
for many standards,
e.g, responding to a
story.
Standard 1
Standard 2
Standard 3
Standard 4
Standard 5
65
The complexity of performance assessment
practices: many to one
Activity 1
Activity 2
Activity 3
Activity 4
Activity 5
Standard X
For any given
standard, there are
many activities from
which we could
gather relevant
evidence about
growth and
accomplishment, e.g.,
reads fluently
66
The complexity of performance assessment
practices, many to many
Activity 1
Activity 2
Activity
Activiity
3
3
Activity 4
Actiivity
5
Activity
5
Standard 1
Standard 2
Standard 3
Standard 4
Standard 5
• Any given artifact/activity can provide evidence for many
standards
• Any given standard can be indexed by many different
artifacts/activities
67
Representation of Self
Assessment
can be
something that
someone or
something else
does to you.
Assessment
can be
something that
you (learn to)
do for yourself.
Post 1996: The Demise of
Performance Assessment
A definite retreat from performancebased assessment as a wide-scale tool
Psychometric issues
Cost issues
Labor issues
Political issues
70
The Remains…
Still alive inside classrooms and schools
Hybrid assessments based on the NAEP
and other assessment models
multiple-choice
short answer
extended response
Is there hope for the future?
The stars may be aligned
Consensus on Comprehension
CCSS
PARCC/SBAC
Renaissance brewing?
Automated scoring
Integration
of the first order—with reading and writing and
language
of the second order—with the disciplines
Consensus on comprehension
Something like a “Kintchian” view that readers
build successive models of meaning
Text base: What the text says
Situation model: What the text means
Kintchian Model
3
Knowledge Base
Text
1
Locate and
Text Recall
Base
Reader as Decoder
Integrate2and Interpret
Situation Model
Reader as Meaning Maker
Experience
Says
Means
Inside the head
Out in the
world
RAND REPORT
The process of simultaneously extracting and
constructing meaning through interaction and
involvement with written language. We use the
words extracting and constructing to emphasize
both the importance and the insufficiency of the
text as a determinant of reading
comprehension.”
10 recurring standards for College and Career Readiness
Show up grade after grade
In more complex applications to more sophisticated texts
Across the disciplines of literature, science, and social studies
Affordances of the CCSS
1. An uplifting vision based on our best research on the
nature of reading comprehension
2. Focus on results rather than means
3. Integrated model of literacy
4. Reading standards are consistent with cognitive
theory
5. Elaborated theory of text complexity
6. Shared responsibility (text in subject matter learning)
for promoting literacy
7. Lots of meaty material in writing and language
standards
1. An Uplifting Vision: ELA CCSS
Students who meet the Standards readily undertake the close,
attentive, reading that is at the heart of understanding and enjoying
complex works of literature.
They habitually perform the critical reading necessary to pick
carefully through the staggering amount of information available
today in print and digitally.
They actively seek the wide, deep, and thoughtful engagement with
high-quality literary and informational texts that builds knowledge,
enlarges experience, and broadens world views.
They reflexively demonstrate the cogent reasoning and use of
evidence essential to both private deliberation and responsible
citizenship in a democratic republic.
PARCC and SBAC: The National
Consortia
A Performance Renaissance?
ORCA: Tuesday: 2:15 Mardi Gras
Ballroom FGH Marriott
Some compelling trends on the horizon
GISA CBAL:
8:15ETS
Mardi Gras
CBAL Tuesday:
and GISA from
Ballroom
FGH
Marriott
ORCA from Uconn
SCALE-New York City
• Global and Integrated Scenario-Bases Assessment
• CBAL (Cognitively-Based Assessment of, for and as
Learning).
• Online Reading Comprehension Assessment
• Stanford Center for Learning, Assessment and
Equity
CBAL ETS
Elegant Conceptual
Organization
Semiotic Space
Literate Practices
CBAL ETS
Lots of Scaffolding
across tasks
CBAL ETS
Categorizing evidence for
arguments
CBAL ETS
MC version of
scaffolding
CBAL ETS
Student accepts
more responsibility
CBAL ETS
Integration of
Reading, Writing,
Rhetoric, and
Discourse
Lead in Tasks
scaffold the
writing and
assess reading
Critique on the way
to production
SCALE-NYC
Challenging
SCALE-New York City
SCALE-NYC
Highly Scaffolded
SCALE-NYC
Explicit Criteria
openly shared
SCALE-NYC
Multiple Primary
Sources
Integration of
Reading, Writing,
Disciplinary
Knowledge
,
What do you think?
Rhetoric, and
What makes
you think so?
Discourse
Thoughtful Learning
Progressions
Simple probes can
lead to deep learning
Simple but deep comprehension questions:
What does this document tell you about the causes
of the Spanish-American War?
What evidence supports your answer?
New Psychometric Tools
Mislevy et al’s EvidenceCentered Design
Advances in scaling
More elaborate design frameworks
Integrated with cognitive accounts of the
Wilson et al’s BEAR system
phenomena being tested
Automated scoring of essays and
constructed responses
From unbrldled skepticism
To
Hopeless enthusiasm
Two faces of integration
Reading-writing-language
Disciplinary grounding
So is there hope for the future?
Maybe +
No
Maybe
Yes
Will we ever get it right?
What is missing
Learning progressions
Generalizability agenda
Disciplinary grounding
Beyond cognition to critical thinking and
critical literacy
Context
Text Task Scenarios
Built on the idea that
reading really is
determined by a
reader interacting
with a text (of some
sort) by completing a
task for a purpose in
a context.
How can we model the variability that
the context brings to comprehension?
Level playing field
Standardize the stimuli and conditions of
assessment
Change the question from
How do folks stack up when I have maximized the
similarity of conditions?
To
Under what conditions can a given student succeed or
fail at a given task?
• Hearkens back to Feuerstein’s dynamic assessment notions
Knowledge
Interest
Purpose
Discipline
Text Complexity
Task Characteristics
Degrees of Scaffolding
Social supports
Another vision of computer adaptive
testing
Report student profiles as a function of the
levels of each relevant variable that put
him/her at different points along the scale.
What would this approach mean for
task generalizability?
Seeking differentiation not generalization
Different underlying construct of “leveling the
playing field”
Not maximizing the standardization of relevant factors
across persons and occasions
Maximizing the optimization of relevant factors within
persons
• How do YOU perform when the stars are optimatlly aligned
in your favor
• How do YOU perform when one or more of those stars is/are
less optimally aligned
Return to the hard work on
assessment
Encouraged by recent funding of new century
assessments
Encouraged by reading for understanding assessment
grants in the US
Encouraged by grass roots efforts
Tests that take the high road
Focus on making and monitoring meaning
Focus on the role of reading in knowledge building and the
acquisition of disciplinary knowledge
Focus on critical reasoning and problem solving
Focus on representation of self.
The unfinished business from the 1990s
Will we ever get it right?
No
High
Stakes
and Low
Maybe
Challenge
Yes