Reading Assessment: Still Time for a Change P. David Pearson UC Berkeley Professor andFormer Former Dean Slides available at www.scienceandliteracy.org.
Download ReportTranscript Reading Assessment: Still Time for a Change P. David Pearson UC Berkeley Professor andFormer Former Dean Slides available at www.scienceandliteracy.org.
Reading Assessment: Still Time for a Change P. David Pearson UC Berkeley Professor andFormer Former Dean Slides available at www.scienceandliteracy.org Why did I pick such a boring topic? I’m a professor! Who needs fun? The consequences are too grave. I have a perverse standard of fun. Slides available at www.scienceandliteracy.org Valencia and Pearson (1987) Reading Assessment: Time for a Change. In Reading Teacher A set of contrasts between cognitively oriented views of reading and prevailing practices in assessing reading circa 1986 New views of the reading process tell us that . . . Yet when we assess reading comprehension, we . . . Prior knowledge is an important determinant of reading comprehension. Mask any relationship between prior knowledge and reading comprehension by using lots of short passages on lots of topics. A complete story or text has structural and topical integrity. Use short texts that seldom approximate the structural and topical integrity of an authentic text. Inference is an essential part of the process of comprehending units as small as sentences. Rely on literal comprehension text items. The diversity in prior knowledge across individuals as well as the varied causal relations in human experiences invites many possible inferences to fit a text or question. The ability to vary reading strategies to fit the text and the situation is one hallmark of an expert reader. Use multiple-choice items with only one correct answer, even when many of the responses might, under certain conditions, be plausible. The ability to synthesize information from various parts of the text and different texts is hallmark of an expert reader. Rarely go beyond finding the main idea of a paragraph or passage. The ability to ask good questions of text, as well as to answer them, is hallmark of an expert reader. Seldom ask students to create or select questions about a selection they may have just read. All aspects of a reader’s experience, including habits that arise from school and home, influence reading comprehension. Rarely view information on reading habits and attitudes as being as important information about performance. Reading involves the orchestration of many skills that complement one another in a variety of ways. Use tests that fragment reading into isolated skills and report performance on each. Skilled readers are fluent; their word identification is sufficiently automatic to allow most cognitive resources to be used for comprehension. Rarely consider fluency as an index of skilled reading. Learning from text involves the restructuring, application, and flexible use of knowledge in new situations. Often ask readers to respond to the text’s declarative knowledge rather than to apply it to near and far transfer tasks. Seldom assess how and when students vary the strategies they use during normal reading, studying, or when the going gets tough. New views of the reading process tell us that . . . Yet when we assess reading comprehension, we . . . Prior knowledge is an Mask any relationship important determinant of between prior knowledge reading comprehension. and reading comprehension by using lots of short passages on lots of topics. A complete story or text has Use short texts that seldom structural and topical approximate the structural integrity. and topical integrity of an authentic text. Inference is an essential for Rely on literal comprehending units as comprehension text items. small as sentences. New views of the reading process tell Yet when we assess reading us that . . . comprehension, we . . . The diversity in prior knowledge across individuals as well as the varied causal relations in human experiences invites many possible inferences to fit a text or question. Use multiple-choice items with only one correct answer, even when many of the responses might, under certain conditions, be plausible. The ability to synthesize information Rarely go beyond finding the main idea from various parts of the text and of a paragraph or passage. different texts is hallmark of an expert reader. The ability to vary reading strategies to fit the text and the situation is one hallmark of an expert reader. Seldom assess how and when students vary the strategies they use during normal reading, studying, or when the going gets tough. What is thinking? You do it in your head, without a pencil..Alexandra, age 4 You shouldn’t do it in the dark. It’s too scary, Thomas, age 5 What is Thinking? Thinking is when you’re doing math and getting the answers right, Sissy, age 5 And in response… NO! You do the thinking when you DON’T know the answer. Alex, age 5 What is Thinking? It’s very, very easy. The way you do it is just close your eyes and look inside your head. Robert, age 4 What is Thinking? You think before you cross the street! What do you think about? You think about what you would look like smashed up! Leon, age 5 What is Thinking? You have to think in swimming class. About what? About don’t drink the water because maybe someone peed in it…and don’t drown! New views of the reading process tell us that . . . Yet when we assess reading comprehension, we . . . The ability to ask good questions of text, as well as to answer them, is hallmark of an expert reader. Seldom ask students to create or select questions about a selection they may have just read. All aspects of a reader’s experience, including habits that arise from school and home, influence reading comprehension. Rarely view information on reading habits and attitudes as being as important information about performance. Reading involves the Use tests that fragment reading orchestration of many skills that into isolated skills and report complement one another in a performance on each. variety of ways. New views of the reading process tell us that . . . Yet when we assess reading comprehension, we . . . Skilled readers are fluent; their word identification is sufficiently automatic to allow most cognitive resources to be used for comprehension. Learning from text involves the restructuring, application, and flexible use of knowledge in new situations. Rarely consider fluency as an index of skilled reading. Often ask readers to respond to the text’s declarative knowledge rather than to apply it to near and far transfer tasks. Why did We Take this Stance? Need a little mini-history of assessment to understand our motives Slides available at www.scienceandliteracy.org The Scene in the US in the 1970s and early 1980s Behavioral objectives Mastery Learning Criterion referenced assessments Curriculum-embedded assessments Minimal competency tests: New Jersey Statewide assessments: Michigan & Minnesota Slides available at www.scienceandliteracy.org 14 Historical relationships between instruction and assessment Skill 1 Teach Assess Conclude Skill 2 Teach Assess Conclude The 1970s Skills management mentality: Teach a skill, assess it for mastery, reteach it if necessary, and then go onto the next skill. 15 Foundation: Benjamin Bloom’s ideas of mastery learning Skill 1 Teach Assess The 1970s, cont. Conclude Skill 2 Teach Assess Conclude Skill 3 Teach Assess Conclude And we taught each of these skills until we had covered the entire curriculum for a grade level. Skill 4 Teach Assess Conclude Skill 5 Teach Assess Conclude Skill 6 Teach Assess Conclude 16 Dangers in the Mismatch we Saw in 1987 False sense of security. Instructionally insensitive to progress on new curricula Accountability will do us in and force us to teach to the tests and all the bits and pieces. Pearson’s First Law of Assessment The finer the grain size at which we monitor a process like reading and writing, the greater the likelihood that we will end up teaching and testing bits and pieces rather than global processes like comprehension and composition. 18 The ideal The best possible assessment teachers observe and interact with students as they read authentic texts for genuine purposes. they evaluate the way in which the students construct meaning. intervening to provide support or suggestions when the students appear to have difficulty. Pearson’s Second Law of Assessment • An assessment tool is valued to the degree that it can approximate the good judgment of a professional teacher! A new conceptualization of the goal Feature Accuracy Fluency Word Meaning Comprehen sion Critique Response Level of Decision-Making Beyond School School IRI or Unit IRI or Unit Test or NRT Test Classroom Individual IRI IRI IRI Norm Refenced NRT IRI IRI Unit or NRT Unit Perform Essay Perform Essay Perform Essay IRI Unit assessment IRI or unit activities Discussion Discussion A 1987 Agenda for the Future Pearson’s Third Law of Assessment When we ask an assessment to serve a purpose for which it was not designed, it is likely to crumble under the pressure, leading to invalid decisions and detrimental consequences. Early 1990s in the USA Standards based reform State initiatives IASA model Trading flexibility for accountability Move from being accountable for the means and leaving the ends up for grabs (doctor or lawyer model) TO Being accountable for the ends and leaving the means up for grabs (carpenter or product model) 24 Mid 1990s Developments Assessment got situated within the standards movement Content Standards: Know and be able to do? Performance Standards: What counts? Opportunity to Learn Standards: Quid pro quo? Standards-Based Reform The Initial Theory of Action Standards Assessment Accountability Clear Expectations Motivation Higher Student Learning Ala Tucker and Resnick in the early 1990s 26 Expanded Theory of Action Standards Assessment Accountability Clear Exp’s Instruction Motivation Professional Development Higher Student Learning Ala Elmore and Resnick in the late 1990s 27 The Golden Years of the 90s? A flying start in the late 1980s and early 1990s International activity in Europe, Down Under, North America Developmental Rubrics Performance Tasks • New Standards • CLAS Portfolios of Various Sorts • Storage bins • Showcase: best work • Compliance: Walden, NYC Increase the use of constructed response items in NRTs Late 1980s/early 1990s: Portfolios Performance Assessments Make Assessment Look Like Instruction From which we draw Activities Conclusions On standards 1-n We engage in instructional activities, from which we collect evidence which permits us to draw conclusions about student growth or accomplishment on several dimensions (standards) of interest. 29 The complexity of modern assessment practices: one to many Activity X Any given activity may offer evidence for many standards, e.g, responding to a story. Standard 1 Standard 2 Standard 3 Standard 4 Standard 5 30 The complexity of performance assessment practices: many to one Activity 1 Activity 2 Activity 3 Activity 4 Activity 5 Standard X For any given standard, there are many activities from which we could gather relevant evidence about growth and accomplishment, e.g., reads fluently 31 The complexity of portfolio assessment practices, many to many Activity 1 Activity 2 Activity 3 Activity 4 Activity 5 Standard 1 Standard 2 Standard 3 Standard 4 Standard 5 • Any given artifact/activity can provide evidence for many standards • Any given standard can be indexed by many different artifacts/activities 32 The perils of performance assessment: or maybe those multiple-choice assessments aren’t so bad after all……. Thunder is a rich source of loudness "Nitrogen is not found in Ireland because it is not found in a free state" 33 The perils of performance assessment "Water is composed of two gins, Oxygin and Hydrogin. Oxygin is pure gin. Hydrogin is gin and water.” "The tides are a fight between the Earth and moon. All water tends towards the moon, because there is no water in the moon, and nature abhors a vacuum. I forget where the sun joins in this fight." 34 The perils of performance assessment "Germinate: To become a naturalized German." "Vacumm: A large, empty space where the pope lives.” Momentum is something you give a person when they go away. 35 The perils of performance assessment The cause of perfume disappearing is evaporation. Evaporation gets blamed for a lot of things people forget to put the top on. Mushrooms always grow in damp places which is why they look like umbrellas. Genetics explains why you look like your father, and if you don't, why you should. 36 The perils of performance assessment "When you breath, you inspire. When you do not breath, you expire." 37 Post 1996: The Demise of Performance Assessment A definite retreat from performancebased assessment as a wide-scale tool Psychometric issues Cost issues Labor issues Political issues 38 The Remains… Still alive inside classrooms and schools Hybrid assessments based on the NAEP model multiple-choice short answer extended response The persistence of standards-based reform. No Child Left Behind Accountability in Spades Every grade level reporting Census assessment rather than sampling (everybody takes the same test) Disaggregated Reporting by Income Exceptionality Language Ethnicity NCLB, continued • Assessments for varied purposes • • • • Placement Progress monitoring Diagnosis Outcomes/program evaluation Scientifically based curriculum too Slides available at www.scienceandliteracy.org There is good reason to worry about disaggregation School 1 School 2 42 Height of bar = average achievement; width = number of students Disaggregation and masking Large N Small N Simpson’s Paradox? Large N Small N A A B B School 1 School 2 43 Disaggregation: Damned if we do and damned if we don’t Don’t report: render certain groups invisible Do report: blame the victim (they are the group that did not meet the standard. 44 Pearson’s Fourth Law of Assessment Disaggregation is the right approach to reporting results. Just be careful where the accountability falls. Pearson’s Fourth Law: A Corollary Accountability, in general, falls to the lowest level of reporting in the system. Assessment can be the friend or the enemy of teaching and learning The curious case of DIBELS, … and other benchmark assessments The Dark Side A word about benchmark assessments… The world is filled with assessments that provide useful information… But are not worth teaching to They are good thermometers or dipsticks Not good curriculum 48 The ultimate assessment dilemma… What do we do with all of these timed tests of fine-grained skills: Words correct per minute Words recalled per minute Letter sounds named per minute Phonemes identified per minute Scott Paris: Constrained versus unconstrained skills Pearson: Mastery skills versus growth constructs Why they are so seductive Mirror at least some of the components of the NRP report Correlate with lots of other assessments that have the look and feel of real reading Takes advantage of the well-documented finding that speed metrics are almost always correlated with ability, especially verbal ability. Example: alphabet knowledge 90% of the kids might be 90% accurate but… They will be normally distributed in terms of LNPM How to get a high correlation between a mastered skill and something else Letter Name Fluency (LNPM) Letter Name Accuracy The wider the distribution of scores, the greater the likelihood of obtaining a high correlation Face validity problem: What virtue is there in doing things faster? naming letters, sounds, words, ideas What would you do differently if you knew that Susie was faster than Ted at naming X, Y, or Z??? Why I fear the use of these tests They meet only one of tests of validity: criterion-related validity correlate with other measures given at the same time--concurrent validity predict scores on other reading assessments--predictive validity Fail the test of curricular or face validity They do not, on the face of it, look like what we are teaching…especially the speeded part Unless, of course, we change instruction to match the test Really fail the test of consequential validity Weekly timed trials instruction Confuses means and ends Proxies don’t make good goals The Achilles Heel: Consequential Validity Give DIBELS Give Comprehension Test Use results to craft instruction Give DIBELS again Give Comprehension Test The emperor has no clothes 57 The bottom line on so many of these tests Pearson’s Third Law again New Bumper Sticker Never send a test out to do a curriulum’s job! The dark side of alignment: the transfer problem I agree about the importance of curriculumbased assessment and situated learning, BUT… We do expect what you learn in one context to assist you in others In our heart of hearts we do NOT believe that kids learn ONLY what you teach OR That only what is tested is what should get learned (and taught) Note our strong faith in the idea of application 59 How do we test for transfer? A continuum of cognitive distance An example: Learn about the structure of texts/knowledge about insect societies--bees, ants, termites New passages Paper wasps A human society A biome How far will the learning travel? Our problem today: THIS IDEA OF TRANSFER IS NOT EVEN ON OUR CURRENT RADAR SCREEN!!! And it ought to be!!!!! 60 Domain representation If we teach to the standards and the assessments, will we guarantee that all important aspects of the curriculum are covered? Linn and Shepard study: improvements on a narrow assessment do not transfer to other assessments Shepard et al: in high stakes districts, high performance on consequential assessments comes at a price... 61 Linn and Shepard’s work... = New Standardized Test = Old Standardized Test Year 1 2 3 4 5 62 Shepard et al work ST = consequential standardized assessment AA = more authentic assessment of the the consequences of Note same skill domain high stakes on alternative assessments ST AA High Stakes Schools ST AA Low Stakes Schools 63 Key Concept: Haladyna Test Score Pollution: a rise or fall in a score on a test without an accompanying rise or fall in the cognitive or affective outcome allegedly measured by the test 64 Aligning everything to the standards: A model worth rejecting Standards Assessment Instruction •This model is likely to shape the instruction too narrowly. •Lead to test score pollution. 65 A better way of thinking about the link between standards, instruction and assessment Standards: How we operationalize our values about teaching and learning Guide the development of both instruction and assessment Teaching and Learning Activities The logic of lots of good reform projects! Assessment Activities This relationship can operate at the regional or local level Pearson’s Fifth Law of Assessment Alignment is a double-edged sword. If there must be alignment, lead with the instruction and let the assessment follow. Pearson’s Sixth Law High Stakes will corrupt any assessment, no matter how virtuous or pure in intent Corollary to Pearson’s Fifth and Sixth Laws The worst possible combination is high stakes and low challenge Hgh Stakes and Low Challenge So how did we do in responding the the challenges from Valencia & Pearson? Issue Grade Solution Prior Knowledge D Choice of Passages Authentic Text B+ Things are lots better on lots of comprehension assessments Inference B Depends on the test Diversity in Knowledge means diversity in response D Constructed response and multiple correct answers or graded answers Flexible use of strategies C Hard to assess; easy to coach; I’d abandon except for diagnostic interviews Synthesizing Information is paramount D Still too much emphasis on details So how did we do in responding the the challenges from Valencia & Pearson? Issue Grade Solution Asking questions as an index of comprehension D No progress except in informal classroom assessment Measuring habits, attitudes, and dispositions C Some reasonable things out there. But no teeth Orchestrating many skills D Too many mastery skills; not enough growth skills Fluency D Made a fetish out of it Transfer and application D Limited to a few situations Overall Grade D Lots of work to do Where should we be headed? So, what makes sense for a district or school? Develop an educational improvement system 72 Elements of an Educational Improvement System Standards, yes Assessments, yes Outcome assessments for program evaluation Benchmark assessments for monitoring individual progress “Closer look” diagnostic assessments for determining individual student emphases Reporting system, yes as long as we are prepared to live with the dilemmas of disaggregation Alignment, but of a different sort 73 Outcome assessments Drop in out of the sky Curriculum independent Assess reading in their most global aspects Growth constructs NOT mastery constructs Could be some sort of standardized assessment Slides available at www.scienceandliteracy.org 74 A plan for early reading benchmark assessments Every so often, give four benchmark assessments. 75 Benchmarks for Intermediate and Secondary Comprehend Deconstruct: Narratives Response to Literature Author’s Craft Creative Writing Information Genres Summaries, Charts, Key ideas Genre (form follows function) What do authors do and why Compose Writing from sources to convey ideas 76 Closer Look Assessments There is no sin in examining the infrastructure of reading Really do need to know which of those pieces kids have and have not mastered Question is what to do about them Teach to and practice the weak bits Rely on strengths to bootstrap the weaknesses Just read more “just right” material Teaching to Weaknesses Flaw Basic Skills Conspiracy of Good Intentions: First you gotta get the words right and the facts straight before you can do the what ifs and I wonder whats? Monitoring Conditions of Instruction Collect data on curriculum, instructional practices We need clear data on enacted curriculum and instructional practices in order to link it as precisely as possible to achievement Use data for program improvement Design professional development 79 Return to the hard work on assessment Encouraged by recent funding of new century assessments Could be some good coming out of our reading for understanding assessment grants in the US Possibilities in the Australian work: NAPLAN?? Tests that take the high road (tests worth teaching to) Focus on making and monitoring meaning Focus on the role of reading in knowledge building and the acquisition of disciplinary knowledge Focus on critical reasoning and problem solving Focus on representation of self. The unfinished business from the 1990s Where Could we Be Headed: A Near Term Research Agenda The Development of More Trustworthy, More Useful Curriculum-Based Assessments Expanding the logic of the Informal Reading Inventory Getting comprehension assessment right Computerized Assessments (yes, but no time today) 81 Expanding the logic of the IRI Benchmark books model ala Reading Recovery Indices of… Level of text one can read independently Accuracy (including error patterns) Fluency Comprehension Not one, not two, not three, but many, many conceptually and psychometrically comparable passages at every level of text challenge. 82 Comprehension Assessment Our models for external assessment, modeled after some of the better wide-scale assessments, are OK. We desperately need a school/classroom tool that does for comprehension what running records/benchmark books have done for oral reading accuracy and fluency 83 Disciplinary Grounding We’re much better off if we ground our comprehension assessments in the inquiry and knowledge traditions of the disciplines rather than to Pearson’s (bet on a) Seventh Law of Assessment Comprehension assessment begins and ends within the knowledge traditions and inquiry processes of each discipline Pearson’s (bet on a) a Corollary to the Seventh Law Summative (big external) assessments of reading comprehension will be better if they begin as formative (smaller internal) assessments of reading comprehension within the knowledge traditions and inquiry processes of each discipline. My bottom line Tests that are Instructionally sensitive Psychometric sound Trustworthy No decision of consequence should be based upon a single indicator. Tests are a means to an end:. 87 To reduce it to a single idea Six, maybe seven laws Two, maybe three corollaries But only one thing truly worth remembering… Never send a test out to do a curriulum’s job! Slides available at www.scienceandliteracy.org Coda in Stuart McNaughton’s Spirit A new bumper sticker with a tinge of optimism. Tests in support of teaching and learning. Computerized Assessment With advances in voice recognition, we are close to being able to teach computers to recognize and score students’ oral responses Applications: Listen to oral reading of benchmark passages and conduct a first level diagnosis (thus eliminating a key barrier, time, to more widespread use of this important diagnostic tool). 91 Computerized Assessment in Early Literacy More applications of voice recognition Phonemic awareness tasks Word reading tasks Phonics tests (both real words and synthetic words) Comprehension assessment • still a way down the road because of the interpretive problem • The computer has to both listen to and understand the response • BARLA: Bay Area Reading and Listening Assessment 92