Interpreting Item Analyses - IDEAL Consortium | Shared

Download Report

Transcript Interpreting Item Analyses - IDEAL Consortium | Shared

Interpreting Item Analyses
Ernest N. Skakun
University of Alberta
Train the Trainer Workshop
October 29, 2003
Division of Studies in Medical Education
Faculty of Medicine & Dentistry
University of Alberta, Edmonton, Alberta, Canada
Hong Kong International Consortium
University of Alberta
Edmonton, Alberta,
Canada
What is item analysis?
• the process by which exam items are
examined critically
What is the purpose of item analysis?
• determine whether items function according
to expectation
• identify structural flaws
• improve item quality
How is item analysis accomplished?
• judgmental
• empirical - statistical
Objectives
At the conclusion of this session, you will be
able to:
1. Identify 2 item analysis methods.
2. Define and interpret various item statistics.
3. Use item statistics to identify possible item
deficiencies.
Judgmental
• Is the content and process contained in the
item relevant?
• Is the item properly structured?
• Is the item free of bias?
• Is the item properly classified?
Empirical – Statistical
• Item difficulty
• Item – total test relationship
• Item discrimination
• Distractor analysis
Example of item and item statistics
Which one of these is the deepest injection?
A.
B.
C.
D.
E.
Z track.
Intra muscular.
Sub cutaneous.
Intra dermal.
Intra venous.
Statistical report for Item 1
ITEM 1: DIF=0.837, RPB= 0.179, CRPB= 0.049
(95% CON= -0.125, 0.220) RBIS= 0.269, CRBIS= 0.073, IRI=0.066
GROUP N
TOTAL 129
HIGH
39
MID
58
LOW
32
INV NF OMIT A
0
0
0 .12
0
.05
0
.12
0
.19
B*
.84
.95
.81
.75
TEST SCORE MEAN %
78.3 80.4
DISCRIMINATING POWER
-.14
.20
STANDARD ERROR OF D.P.
.08
.08
C
.00
.00
.00
.00
D
.01
.00
.02
.00
E
.04
.00
.05
.06
.00
.00
.00
80.0 76.3
.00
-.06
.00
.04
Definition and interpretation of statistical indices
ITEM 1: DIF=0.837, RPB= 0.179, CRPB= 0.049
(95% CON= -0.125, 0.220) RBIS= 0.269, CRBIS= 0.073,
IRI=0.066
DIF – Difficulty (0.837) ~84%
• proportion of examinees answering item correctly
• range 0.0 – 1.00 (0 – 100%)
• DIF --  100 easy
RPB – Point-biserial correlation (0.179)
• Correlation between item and total test score
• range -1.00 – 1.00
Definition and interpretation of statistical indices (cont’d)
CRPB – Corrected point-biserial correlation (0.049)
• correlation between item and total score not
including the item in the total score
Estimating the point-biserial correlation
Item 1
Item 2
1
1
1
0
0
0
0
0
1
1
Item 3
Total
Test
1
0
1
0
1
96
85
82
63
56
More Statistical indices ....
95%CON – 95% Confidence Interval for the CRPB
(95% CON = - 0.125, 0.220)
RBIS – Biserial correlation (0.269)
• Correlation between item and total test score
CRBIS – Corrected biserial correlation (0.073)
• Correlation between item and total test score not including
the item in the total score
IRI – Item Reliability Index (0.066)
• Product of point biserial correlation (RPB) and the square
root of the product of DIF and 1-DIF
Other Aspects of the Item Report
GROUP N INV NF OMIT A B* C D E
N
Number of examinees in the group (129)
INV
Number of examinees NOT providing a valid
response to this item (0)
NF
Number of examinees NOT finishing the
test from this item onwards (0)
OMIT Number of examinees omitting this item
A – E Alternatives, best answer is * (B*)
Other Aspects of the Item Report (Cont’d)
GROUP
TOTAL
HIGH
MID
LOW
N
129
39
58
32
INV NF OMIT A
0
0
0 .12
0
.05
0
.12
0
.19
TEST SCORE MEAN %
DISCRIMINATING POWER
STANDARD ERROR OF D.P.
B* C
.84 .00
.95 .00
.81 .00
.75 .00
D
.01
.00
.02
.00
E
.04
.00
.05
.06
78.3 80.4 .00 80.0 76.3
-.14 .20 .00
.00 -.06
.08 .08 .00
.00 .04
Other Aspects of the Item Report (Cont’d)
HIGH – Approximately 27% of the examinee group
scoring the highest on the test
MID – Approximately 46% of the examinee group
LOW - Approximately 27% of the examinee group
scoring the lowest on the test
TEST SCORE MEAN % - Mean % score on the total
test of the examinees selecting their indicated
response to this item
Other Aspects of the Item Report (Cont’d)
DISCRIMINATING POWER –
The difference between the proportions of the HIGH
and LOW groups selecting the indicated response to
this item
STANDARD ERROR OF D.P. –
The amount of error associated with the
DISCRIMINATING POWER
Tying it all together
Which one of these is the deepest injection?
A.
B.
C.
D.
E.
Z track.
Intra muscular.
Sub cutaneous.
Intra dermal.
Intra venous.
ITEM 1: DIF=0.837, RPB= 0.179, CRPB= 0.049
(95% CON= -0.125, 0.220) RBIS= 0.269, CRBIS= 0.073, IRI=0.066
GROUP
N
TOTAL 129
HIGH
39
MID
58
LOW
32
INV
0
0
0
0
NF
0
TEST SCORE MEAN %
DISCRIMINATING POWER
STANDARD ERROR OF D.P.
OMIT
0
A
.12
.05
.12
.19
B*
.84
.95
.81
.75
C
.00
.00
.00
.00
D
.01
.00
.02
.00
E
.04
.00
.05
.06
78.0 80.0 0.00 80.0 76.0
-.14 .20 .00
.00 -.06
.08 .08 .00
.00 .04
Graphic display of examinee performance by quintiles
QUINTILE:
ITEM 1
Highest*
4th*
3rd*
2nd*
Lowest*
|----|----|----|----|----|----|----|----|----|----|
DIFF=.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
What would we like to see for item statistics?
ITEM 1: DIF= 0.750, RPB= 0.421, CRPB= 0.376
(95% CON= 0.291, 0.520)
RBIS= 0.539, CRBIS= 0.473, IRI=0.182
GROUP N INV NF
TOTAL 129 0
0
HIGH
39 0
MID
58 0
LOW
32 0
OMIT A
B*
0
.10 .75
.00 1.00
.08 .74
.15 .48
TEST SCORE MEAN %
DISCRIMINATING POWER
STANDARD ERROR OF D.P.
C
D
.08 .07
.00 .00
.06 .05
.11 .10
E
.10
.00
.07
.16
68.3 74.4 70.0 65.0 71.3
-.15 .52 -.11 -.10 -.16
.09 .16 .08 .08 .09
Using item statistics to identify possible problems
ITEM 1: DIF= 0.223, RPB= 0.061
GROUP N INV NF OMIT A* B C D E
TOTAL 129 0
0
0 .22 .18 .20 .19 .21
Using item statistics to identify possible problems
(Cont’d)
ITEM 2: DIF= 0.109, RPB= 0.021
GROUP N INV NF OMIT A B C D* E
TOTAL 129 0 0
0 .01 .71 .04 .11 .13
Using item statistics to identify possible problems
(Cont’d)
ITEM 3: DIF= 0.503, RPB= 0.291
GROUP N
TOTAL 229
INV
0
NF
0
OMIT A
0
.44
B
.00
C*
.50
D
.00
E
.06
B
.00
.00
.00
.00
C*
.50
.71
.43
.38
D
.00
.00
.00
.00
E
.06
.03
.06
.09
ITEM 3: DIF= 0.503, RPB= 0.291
GROUP N
TOTAL 229
HIGH
MID
LOW
INV
0
0
0
0
NF
0
OMIT A
0
.44
.26
.51
.53
Using item statistics to identify possible problems
(Cont’d)
ITEM 4: DIF= 0.821, RPB= -0.181
GROUP N INV NF OMIT A
TOTAL 113 0
0
0
.00
HIGH
0
.00
MID
0
.00
LOW
0
.00
B C D
.00 .18 .00
.00 .25 .00
.00 .24 .00
.00 .06 .00
E*
.82
.75
.76
.94
Using item statistics to identify possible problems
(Cont’d)
ITEM 5: DIF= 0.721, RPB= -0.181
GROUP N INV NF OMIT A*
TOTAL 160 0
0
0
.72
HIGH
0
.62
MID
0
.71
LOW
0
.82
B
.03
.06
.03
.00
C
.15
.18
.14
.12
DISCRIMINATING POWER
.06
.06 .06 .03
-.20
D
.08
.12
.09
.06
E
.02
.03
.03
.00
Final Words on Item Analyses
Begin with the judgmental method and ensure that the
responses to the following questions are positive:
• Is the content and process contained in the item
relevant?
• Is the item properly structured?
• Is the item free of bias?
• Is the item properly classified?
If the response to any of the above questions is negative,
then take corrective measures.
Then consider the item statistics.
• Is the item of appropriate difficulty?
• Is the item – total test score correlation positive?
• Is the discriminating power for the best answer
positive?
• Is the discriminating power for each distractor
negative?
• Are the distractors functional?
References:
Case S.M. & Swanson D.B. (2001).
Constructing written test questions for the basic and clinical
sciences. Philadelphia: National Board of Medical Examiners.
Osterlind S.J. (1998).
Constructing test items. Boston: Kluwer Academic Publishers.
Determining
ANGOFF
&
NEDELSKY VALUES
for Items
Determining Angoff & Nedelsky Values for Items
• How much is enough?
• Setting a passing score for an exam
• Early attempts were either normative or absolute
• Present practice relies on methods that are test-centered
or examinee-centered or a combination of the two
• Requires decisions made by subject matter experts
(judges)
Methods
Test-centered
–
–
–
–
Angoff
Nedelsky
Ebel
Jaeger
Panels of subject matter experts (judges) render decisions
regarding expected performance of minimally competent
examinees (borderline) on each item.
Objectives
At the conclusion of this session, you will be
able to:
1. Identify the task for the judge in the Angoff and
Nedelsky methods.
2. Explain how judges’ decisions on individual items
are used in setting a passing score.
Angoff
Task
Judge is to indicate the proportion of minimally
competent examinees that should answer the item
correctly.
Angoff (Cont’d)
A previously healthy 26-year-old female is suddenly
seized with pleuritic pain in the left chest and
shortness of breath.
The most likely cause is:
A. mycoplasma pneumonia.
B. spontaneous pneumothorax.
C. pulmonary embolism.
D. acute pericarditis.
E. epidemic pleurodynia.
Example of the Angoff Method
Item
1
2
3
4
5
6
7
8
9
10
Probability of
Correct Answer
1.00
.65
.80
.55
.75
.65
.50
.45
.25
.40
Sum = 6.00
Nedelsky
Task
Judge is to indicate the alternatives minimally
competent examinees should eliminate as
incorrect.
With a normal curve of distribution, the percent of
observations that lie within plus and minus one
standard deviation from the mean is approximately:
A.
10%.
B.
15%.
C.
34%.
D.
68%.
E.
95%
Nedelsky (Cont’d)
A
B
C
D
E
Eliminated alternatives – A B
Number of remaining alternatives = 3
Reciprocal of remaining alternatives = 1/3
Nedelsky value for item = .33
Example of Calculations for Nedelsky Method
Item
1
2
3
4
5
6
7
8
9
10
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
Number of
Alternatives
Not Eliminated
3
1
2
3
1
2
4
4
5
1
Reciprocal
1/3
1/1
1/2
1/3
1/1
1/2
1/4
1/4
1/5
1/1
Expected
Score
.33
1.00
.50
.33
1.00
.50
.25
.25
.20
1.00
Sum=5.36
• Regardless of method, judgments are arbitrary, but
should NOT be capricious.
• The level of performance set for each item is a
minimal standard.
• Likewise, the passing score set for the test is a
minimal standard.
Process
• Committee consensus on Angoff values for items
• If no committee, use best judgment in setting Angoff
value for an item you have generated
• Ask a colleague to review your judgments
• Use item analysis data as a reality check on judgments
GROUP N
TOTAL 129
HIGH
39
MID
58
LOW
32
INV NF OMIT A
0
0
0 .12
0
.05
0
.12
0
.19
B*
.84
.95
.81
.75
C
.00
.00
.00
.00
D
.01
.00
.02
.00
E
.04
.00
.05
.06
A word of caution on using item statistics ….
• Adopt an iterative process for reviewing the
level of performance set for the item.
References:
• Case S.M. & Swanson D.B. (2001).
Constructing written test questions for the basic and clinical sciences.
Philadelphia: National Board of Medical Examiners.
• Cizek G.J. (Ed.) (2001).
Setting performance standards: Concepts, methods, and perspectives.
Hillsdale: Lawrence Erlbaum Associates
• Livingston, S.A. & Zieky M.J. (1982).
Passing scores: A manual for setting standards of performance on
educational and occupational tests. Princeton: Educational Testing
Services