StandardSettingResearch

Download Report

Transcript StandardSettingResearch

Advanced Topics in Standard
Setting
Advanced Topics in Standard Setting
• Methodology
• Implementation
• Validity of standard setting
What is Standard Setting?
• Students taking standard-based tests must
perform in such a way that it can be evident
from their test scores that they have
achieved at least a minimum level of
competency (MLC). This MLC is defined
operationally as the cut score.
• The process of recommending the cut score
is called standard setting.
Standard Setting Methods
• Test Centered
• Examinee Centered
• Item Mapping
Standard Setting Methods
• Test Centered:
– Items are presented to the panelists in the same
order as they appear on the test
– Panelists predict a target student’s (e.g., a barely
proficient) performance on each item in the test
Angoff (1971)
• Very popular in
licensure and
certification tests
• What is the
probability that a
barely proficient
student will answer
the item correctly?
Item
P1
P2
P3
1
0.60
0.75
0.50
Avg.
2
0.75
0.70
0.75
Avg.
3
0.35
0.25
0.50
Avg.
4
0.85
0.90
1.00
Avg.
SUM SUM SUM
Avg.
Angoff (1971)
• Cognitively very challenging
– Conceptualize a barely proficient student
– Predict probability of giving correct response
to the item
• Panelists tend to underestimate difficulty of
an easy item and overestimate difficulty of a
hard item
Yes/No (1997)
• For each prediction
panelist asks, “whether
2/3 of the barely
proficient students will
be able to answer the
test item correctly (yes
or no)?”
• Cognitively less
challenging
• Used only for MC item
tests
Item
P1
P2
P3
1
1
1
0
Avg.
2
1
1
1
Avg.
3
0
0
1
Avg.
4
1
1
1
Avg.
SUM SUM SUM
Avg.
Extended Angoff
• Each item has maximum
possible score of 4.
• Scoring rubric: students
can get only 0, 1, 2, 3, or
4 points.
• What is the probable
score that a barely
proficient student will
get on the item?
Item
P1
P2
P3
1
2.0
3.0
3.0
Avg.
2
1.0
1.0
1.0
Avg.
3
3.0
3.0
2.0
Avg.
4
3.0
2.0
2.0
Avg.
SUM SUM SUM
Avg.
Standard Setting Methods
• Examinee Centered:
– Panelists identify a target student’s paper (i.e.,
student folder) that is consistent with
performance level descriptions
• Popular Methods
– Analytical Judgmental Method (2000)
• Used for only CR and writing tests
• Not used for a test that has both MC and CR items
– Body of Work (1994)
Standard Setting Methods
• Item Mapping:
– Items are presented to panelists in an order item
booklet (OIB) based on difficulty level of the
items
– Panelist review each item and determine what
knowledge, skills, and abilities must be
required to answer a given item correctly
– What makes each item progressively more
difficult than the previous item in the OIB.
Standard Setting Methods
• Bookmark (1995)
• Can be used both MC and CR item tests
• Cognitively less challenging
• Mapmark (2005)
• Used in NAEP (2005)
• Too expensive: may not be suitable for state
assessment programs
Standard Setting Methods
• Mixed Method (2006)
– Blend of Angoff and Bookmark standard setting
method
– Using the strengths of these two methods
– Not yet operationally implemented
– Experts in standard setting (personal
communication) seem to have liked this
method.
Summary: Methods
• Select the method that is appropriate for the
assessment program
– Test item type
– Consistency, method used previously in the
program
– Prior experience with the method (person
implementing)
– Resources available
Implementation
Implementation
• General Standard Setting Process
–
–
–
–
–
–
–
–
Selection of panelists
Orientation
Review test materials
Review and discuss Performance Level Descriptors
(PLDs)
Round 1 ratings
Feedback
Round 2 ratings
Evaluation
Who are the Panelists
(1) Psychometricians (2) DOE staffs (3) Item
writers (4) Subject matter experts
•
•
•
•
•
Content knowledge
Understands student population
Knowledge of instructional environment
Appreciates the consequences of the standards
Relevant stakeholders (parents, university level
educators, etc.)
How Many Panelists
• How many panelists should we have for a
standard setting study?
(a) 0 (b) 5-6 (c) 7-9 (d) 10-15 (e) 16-40
• 10-15 panelists may be sufficient to set a
defensible cut score
• The magnitude of error is influenced by more
than the number of participants
• It suggests that precision of the cut score is
influenced by other factors.
What Factors Influence their
Prediction
•
•
•
•
•
Qualification of panelists
Orientation
Impact data
NCLB
Conceptualization of a barely proficient
student
• Thinking about one, who is in my
classroom
Number of Rounds
• How many times should we collect
panelists’ ratings?
(a) 0
(b) 2
(c) 3 (d) 4 (e) 5
• Two rounds of panelists’ ratings may be
adequate to get reasonable and defensible
cut score
• Third round ratings may not differ much
from round two ratings
Feedback
• Should we provide feedback to the panelists
between the rounds?
If yes, what kind of data?
We often provide
• Impact data (e.g., if the cut score is 23 out of a
possible score point of 40 what % of students in
the state will be classified as Proficient?)
• Summary of their round 1 ratings (Panelists’
locations)
• Student’s profile
Feedback
• Process feedback (panelists’ location) typically has
the effect of reducing the standard deviation of cut
scores set by panelists
• It is also evident that the effect of feedback
diminishes as the number of rounds of feedback
and rating increases.
Evaluation
• It will be discussed in the Validity section
• It is a very important
standard setting study
component in a
Streamlining of Standard Setting
Methods
Angoff-based Methods
– Web-based standard setting
– Reduce number of items rated by panelists
• Divide the test and the panelists into equivalent
groups
• Use only a subset of items
» content and difficulty
» content, discrimination, and difficulty
» 50% items may be adequate
Streamlining of Standard Setting
Methods
• Bookmark Method
– Use only a subset of items
• Content, item-type, and difficulty
• 70% items may be adequate
Ordered Item Booklet in Bookmark
Easiest
-3
-2.5
-2
Hardest
-1.5
-1
Cut 1 = avg.(-1.85 and
-1.5) = -1.68
-0.5
0
Cut 2
0.5
1
Cut 3
1.5
2
2.5
q/ b
3
Conversion of Raw Cuts to
Theta Cuts
Expected Total Score
5
4 Cut 3 = 4
3
2
Cut 2 = 3
Cut 1 = 2
1
0
-3
-2
q/ b
-1
0
Cut 1 Cut 2
1
Cut 3
2
3
Scaled Score
• Scaled scores are typically a linear
transformation of ability estimates
• Example of a linear transformation:
– (Ability x Slope) + Intercept
Summary: Implementation
• Panelists should be subject matter experts
• 10-15 panelists are adequate for a standard
setting study
• Two-round of panelists’ rating may be
sufficient to estimate a defensible cut
• A reasonable feedback data should be
provided
Summary: Implementation
• When designing a standard setting study,
potential influencing factors should be
considered
• Streamlined standard setting procedures
may be a good consideration for low-stake
tests.
Validity of Standard Setting Process
Validity of Standard Setting
• Assumptions:
– Policy assumption: It claims that the
performance standards are appropriate, given
the purpose of the decisions
– Operational assumption: It claims that students
with scores at or above the cutscores are likely
to meet the performance standards, and students
with scores below the cutscore are not likely to
meet the standard.
Validity of Standard Setting
• Policy assumption is often evaluated by
documenting procedural evidence (e.g.,)
–
–
–
–
–
Purposes of the decision process
Selection of panelists
Training of panelists
Definition of performance standard
Data collection procedure
Validity of Standard Setting
• Operational assumption is examined
through internal consistency evidence and
external criteria
– Internal consistency: Results that are not
internally consistent do not justify any
conclusions.
Internal Consistency Evidence
• Precision of estimates of the cutscore
– Standard error of the cutscore: If the standard
setting study was repeated, to what extent we
would be likely to get the same cutscore
• Analysis of item-level data
– Examining performances of students with
scores near the cut score
– Comparing performance of two groups of
students (one with scores a bit above the cut
and the other a bit below the cut)
Internal Consistency Evidence
• Examining performances of students with
scores near the cut score. For example,
panelists set the proficient cut at 25 out of a
possible score of 40.
– For an item, panelists think almost 90% of
borderline proficient students should be able to
answer the item correctly. If conditional pvalue of the item for students who got a score
of 25 is much different than 0.90, it implies that
the cut score may not be accurately placed.
Internal Consistency Evidence
• Comparing performance of two groups of
students (one with scores a bit above the cut
and the other a bit below the cut):
– Compare conditional p-values of the items. We
would expect that p-values for above-the-cut
student group will be higher than p-value for
below-the-cut student group.
Internal Consistency Evidence
• Intra-panelist consistency: A measure of
how consistently a panelist provides
judgments across the items
• Inter-panelist consistency: A measure of
how consistently the panelists provide
judgments on each item
– An index that measures inter-panelist
consistency for an Angoff, Body of work, and
Bookmark standard setting methods
External Criteria
• Comparisons to results of other standard setting methods
– Challenges: Different methods are not equally appropriate for a
certain type of test
• Judgments by stakeholder groups (e.g., classroom
judgment data)
– Challenges: Finding stakeholders who are qualified to make this
judgment (e.g., stakeholders may have incomplete understanding
about the performance level definition)
• Comparisons involving other assessment methods
– Existing classification data could be used as the basis for checking
the appropriateness
– Challenges: Finding an appropriate external criterion
Summary: Validity
• Standard setting (or setting performance
standards) is a judgmental method
• Different methods may set different
performance standards for the same test
• This is ultimately a policy decision
• Standard setting procedure needs to be well
documented in order to make it defensible.