Diagnostic Score Reporting

Download Report

Transcript Diagnostic Score Reporting

Improving the Ways We Report
Test Scores
Ronald Hambleton, April Zenisky
University of Massachusetts
Amherst, USA
CERA Annual Meeting, June 1, 2010.
1
Important Time in the Testing Field
 New provincial tests in Canada and
state tests in the USA being
introduced as part of educational
reform (e.g., MA went from 7 to more
than 24 in 10 years).
 Users need to understand and use
the scores and score reports correctly
(or substantial funding is wasted).
2
1. Considerable investment of time
and money has been made to
address technical problems:
• IRT modeling of data, test scoring of
performance data, test score equating,
reliability estimation, computer
technology, DIF analyses, standardsetting, and validity studies.
3
2. Surprisingly, test score reporting
attracts very little attention!
 Name a research study?
 Without clear and meaningful
reporting of information, the other
steps are of less value!
 Also, on this topic, more than other
technical topics, many persons thinks
they are experts—everyone has an
idea here about what to do!
4
AERA, APA, NCME Test Standards:
What do they say about score
scales and reporting?
5.10. When test score information is
released….those responsible should
provide appropriate interpretations.
--information is needed about content
coverage, meaning of scores, precision
of scores, common misinterpretations,
and proper use.
5
13.14 …Score reports should be
accompanied by a clear statement of
the degree of measurement error
associated with each score or
classification level and information on
how to interpret the scores.
6
Major Problems in Score Reporting!
 Reporting scales and data displays
(the reports) are confusing to many
persons:
 percents vs. percentiles;
 IQ scores;
 New scales developed by states and
provinces
 T scores, stanine scores.
7
Major Problems in Score Reporting!
 Quantitative literacy is not high
(three kinds of persons!). Half of
population can’t read bus schedules
in the US. What’s 20 million dollars
for testing? (1/3 of 1% of education
budget)
 NRT vs. CRT scores.
8
Major Problems in Score Reporting!
 Body of evidence highlighting score
reporting problems (e.g., Jaeger)
 Reporting scores without error bands
 Too much meaningless score
information on some reports (called
“chart clutter” by Tufte)
 Not providing meaningful diagnostic
information
9
Goals of the Presentation
1. Consider student reports—improving the
meaning of score scales and diagnostic
reports.
2. Mention several emerging methodologies
for researching score reports and their
utility.
3. Identify a seven step model for improving
score report design and evaluation.
10
Individual Test Score Reports
 In the USA, over 30,000,000
individual reports, alone, to
parents of school children.
 Over 1000 credentialing exams,
and some of the exams exceed
100,000 candidates (e.g.,
securities, accountants, nurses)
11
Shortcomings in the Student Reports
(Goodman & Hambleton, AME, 2004)
 No stated purpose, no advanced
organizer, no clues about where to start
reading.
 Performance categories (typically) are
not defined, even briefly.
 No error bands on any of the reported
scores, or even a hint that errors of
measurement (i.e., imprecision) are
present!
12
Shortcomings in the Student Reports
 Font is often too small to read easily.
 Instructional needs information is not
always user-friendly—e.g. (to a parent),
“You need help in “extending meaning by
drawing conclusions and using critical
thinking to connect and synthesize
information within and across text, ideas,
and concepts.”
13
Shortcomings in the Student Reports
 Several undefined terms on the displays:
percentile, prompt, z score, performance
category, achievement level, and more.
 Basically, the reports are crowded!
14
Two Ideas for Score Reports
 Bench-marking is one of our favorites
and most promising:
 Capitalizes on item response theory
(IRT)—strong modeling of data, and items
and candidates being reported on the
same scale.
 Researchers have been slow to take
advantage of this
15
Bench-Marking Solution: Makes Scale
Scores More Meaningful
 Place boundary points on the reporting
scale
 Choose a probability associated with
“knowing/can do”, say, 65%.
 Use the ICCs from IRT to develop
descriptions of what examinees can and
cannot do between boundary points.
16
(3P) Item characteristic Curve (ICC)
0.5
0.0
Frequency
Probability of
Correct Response .
1.0
A
B
-3
-2
-1
0
Ability
1
2
3
17
Item Characteristic Curves for 60 Items
1.00
Expected Score
(on the 0-1 metric)
0.90
0.80
0.70
P=0.65
0.60
0.50
Reporting
Category
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
0.40
0.30
0.20
0.10
Items
Points
13
18
9
8
12
16
21
12
11
15
0.00
-3
-2
W
-1
N
0
P
1
2
3
Proficiency Scale
18
Making Score Scales More Meaningful
1.00
Probability
0.75
0.50
0.25
0.00
200
300
400
500
600
700
800
Mathematics
19
Making Score Scales More Meaningful
Probability
1.00
0.75
0.65
0.65
0.50
0.25
0.00
200
300
400
400
B
500
500
P
600
600
A
700
800
Mathematics
20
Making Score Scales More Meaningful
1.00
Probability
0.75
0.65
0.50
0.25
0.00
200
300
400
400
500
600
700
800
Mathematics
21
Making Score Scales More Meaningful
1.00
Probability
0.75
0.65
0.50
0.25
0.00
200
300
400
500
500
600
700
800
Mathematics
22
Making Score Scales More Meaningful
1.00
Probability
0.75
0.65
0.50
0.25
0.00
200
300
400
500
600
600
700
800
Mathematics
23
Making Score Scales More Meaningful
1.00
Probability
0.75
0.65
0.50
0.25
0.00
200
300
400
400
500
500
600
600
700
800
Mathematics
24
Meaning of the Mathematics Scale
200
300
400
500
600
700
800
700-790:
have
Level 600-690:
level
show
aable
clear
500-590:Students
arethe
to
Level
400-490:
Studentsatatthis
this
level
display
the
300-390:
show
aability
200-290:
can
to apply
and
problem
solving
increase
insolve
ability
to
more
demanding
problems,
solve
multi-step
problems
inproblems
different
content
areas
ability
toinsight,
greater
variety
of mathematical
basic
problems
beginning
ability
to solve
recall
and
use
sometimes
solveareasoning,
very
basic
in
each
of
the
strategies
to
solve
aexample,
wide
ofbasic
problems
to
generalize,
tocontent
understand
mathematical
and
can
make
connections
between
areas.
in
each
the
areas.
For
example,
they
can
facts
andof
terminology
torange
solve
problems.
For
content
areas.
For
they
cancontent
solveboth
simple
within
andthey
across
the
content
areas.
For
example,
they
terminology,
and
to
make
connections.
example,
For
example,
they
can
solve
multi-step
recognize
relationships
and
solve
problems
example,
can
identify
the
ruleroutine
forFor
apercent
simple
arithmetic
problems
and
read
simple
data
displays.
they
can solve
canand
solve
problems
complex
involving
counting
newly-defined
problems
involving
functions
problems
and
canvery
use
algebraic
skills
to solve
presented
insolve
verbal,
mathematical,
or graphical
pattern
routine
geometry
problems.
in
more
than
two
variables
and
can
solve
conditional
permutations/
combinations,
generalize
complex
geometry problems.
forms.
probability
by constructing
and
analyzing a
patterns,
andproblems
solve multi-step
problems
involving
table of possible outcomes.
geometric/algebraic
relationships.
25
Common Diagnostic Report

Candidate results by subdomain
categories (e.g. math):
Content Domain
Score Points Percent Correct
1. Data Analysis, Stats (20%)
1 of 10
2. Geometry (10%)
6 of 8
75%
3. Measurement (20%)
9 of 12
75%
4. Number Sense/Operations (15%)
4 of 9
5. Patterns (35%)
4 of 22
10%
44%
18%
0%
26
100%
Highly Problematic Report!!
 No sense of measurement error
 No guarantee that the items are
representative
 No basis for score interpretation
27
Mathematics
Content Domain
Your Performance Compared to
Passing Students
Passing
Your
Student
Performance
Performance
Weaker Comparable
1. Data Analysis,
Stats (20%)
10%
20%
2. Geometry (10%)
75%
60%
3. Measurement
(20%)
75%
90%
X
4. Number Sense/
Operations (15%)
44%
60%
X
5. Patterns (35%)
18%
65%
X
Overall Performance
Multiple Choice (70%)
Constructed Response (30%)
Stronger
X
X
Weaker Comparable
Stronger
X
X
28
A Better Report!!
 Confidence bands
 A frame of reference:
performance of borderline
candidates, or passing
candidates, for example.
29
Score Report Design & Evaluation
 Experiments
 Focus Groups
 Think-alouds
 Qualitative Reviews from the Field
 Tryouts
30
7 Steps in Report Development
Define purpose
of score report
Identify intended
audience(s)
Review report
examples/literature
Develop reports(s)
Data
collection/field test
Ongoing
maintenance
Revise and
redesign
31
Necessary Research
 Reducing the size of error bands for
knowledge/skill areas
 improving the quality of test items
 Improving the targeting of the test
 capitalizing on correlational information
among the skills or other priors
32
Necessary Research (cont.)
 Learning to move from the ICCs, to
choosing the number of performance
categories, to preparing the
descriptive statements that can
enhance the meaning of a score
scale, and validation.
33
Final Remarks
 Important advances have been made
in score reporting.
 More research needed on matching
score reports to intended audiences,
and evaluating score reports prior to
use.
 Diagnostic reports are important to
users but need more research.
34
Final Remarks
 Seven step model should be used, and
exemplar reports compiled.
 We are pleased to see the developments
taking place.
--States, provinces and countries are
beginning to use the tools and progress
can be seen.
 See the NCME bibliography by Deng and
Yoo with 70+ pages of references!
35