to view the PowerPoint presentation

Download Report

Transcript to view the PowerPoint presentation

Using Rasch Analysis to Develop an
Extended Matching Question (EMQ)
Item Bank for Undergraduate Medical
Education
Mike Horton
Bipin Bhakta
Alan Tennant
Medical student training

Miller (1990) identified that no single assessment
method can provide all the data required for judging
anything as complex as the delivery of professional
services by a competent physician
• Knowledge
• Skills
• Attitudes
Miller’s pyramid of competence
does
case presentations
log books
direct observation of
clinical activity
shows how
OSCE
knows how
EMQ, FRQ,
Essays
knows
MCQ
What are assessments used for?

Primary aim
• To identify the student who is deemed to be safe
and has achieved the minimal acceptable standard
of competence

Secondary aim
• To identify students who excel
Written Assessment







“True or False” questions
“Single, best option” multiple choice questions
Multiple true or false questions
“Short answer” open ended questions
Essays
“Key feature” questions
Extended matching questions
Free response questions


Broadly, the free-response type questions are commonly
believed to test important higher-order reasoning skills
Validity is high as examinees have to generate their own
responses, rather than selecting from a list of options
However…



Only a narrow range of subject matter can be assessed in a
given amount of time
They are administratively resource-sapping
Due to their nature, the reliability is limited.
Multiple choice questions
Multiple choice type questions have been widely used and are
popular as:


They generally have a high reliability
They can test a wide range of themes in a relatively short period
of time
However…


They only assess the knowledge of isolated facts
By giving an option list, examinees are cued to respond and the
active generation of knowledge is avoided
What are Extended Matching Questions (EMQs)?

EMQs are used as part of the undergraduate medical course
assessment programme

EMQs are used to assess factual knowledge, clinical decision
making and problem solving

They are a variant of multiple choice questions (MCQs)

EMQs are made up of 4 components
Example of Extended Matching Question (EMQ) format
(Taken from Schuwirth & van der Vleuten, 2003)
Theme:
Micro-organisms
Answer options:


A
B
C
D
E
F
G
H
Campylobacter jejuni
Candida Albicans
Clostridium difficile
Clostridium perfringens
Escherichia coli
Giardia lamblia
Helicobacter pylori
Mycobacterium tuberculosis
Instructions:


I
J
K
L
M
N
O
P
Proteus mirabilis
Pseudomonas aeruginosa
Rotavirus
Salmonella typhi
Shigella flexneri
Tropheryma whippelii
Vibrio cholerae
Yersinia enterocolitica
For each of the following cases, select (from the list above) the microorganism most likely to be responsible. Each option may be used once,
more than once or not at all.
A 48 year old man with a chronic complaint of dyspepsia suddenly develops severe
abdominal pain. On physical examination there is general tenderness to palpitation with
rigidity and rebound tenderness. Abdominal radiography shows free air under the
diaphragm.
A 45 year old woman is treated with antibiotics for recurring respiratory tract infections.
She develops a severe abdominal pain with haemorrhagic diarrhoea. Endoscopically a
pseudomembranous colitis is seen.
Item Pools

Currently, a lot of institutions formulate their tests from year to year by
selecting items from a pre-existing pool of questions.
• Questions are pre-existing
• Time and resources are saved by employing this method.
However…

It has been widely recognised that if tests are made up of items from a
pre-existing item pool, then the relative difficulty of the exam paper will
vary from year to year. [McHarg et al (2005), Muijtens et al (1998),
McManus et al (2005)]
Item Pools

If the questions have been set, used and assessed using traditional
approaches, then this will provide a certain amount of information about each
of the individual questions, however, there are also drawbacks surrounding
the traditional approach.

It has been recognised [Downing (2003)] that there are certain limitations to
Classical Measurement Theory (CMT) in that it is sample dependent.

Thus, the comparability of examination results from year to year will be
confounded by the overall difficulty of the exam and the relative ability of the
examinees, rendering a comparison invalid.

This is particularly troublesome when we wish to
• compare different cohorts of students
• maintain a consistent level of assessment difficulty over subsequent administrations.
The problem

What is the best way to ensure that all passing students are reaching the
required level of expertise??

2 forms of pass mark selection
• Criterion referenced
• Norm referenced

Criterion referenced refers to when a specific pass mark has been
designated prior to the exam as a pass/fail point.

Norm referenced refers to when a specific proportion of the sample will be
designated to pass the exam. i.e. the highest scoring 75% of students will
pass
Norm referenced or Criterion Testing?
Norm-referenced testing





Whatever the ability of the students taking the test, a fixed portion of them
will pass/fail
The standard needed to pass the test is not known in advance
The validity of the norm-referencing method relies on a homogenous
sample, which may not necessarily be the case
There is also the risk that with a less able group of students, a student
could pass the exam without reaching the desirable acceptable level of
clinical knowledge.
Norm referencing is not appropriate
Norm referenced or Criterion Testing?
Criterion testing

Relative difficulty of Exam could change depending on the items that are
in the exam, therefore a pre determined pass mark could be easier or
harder to obtain depending upon the items in the test

Although having their own disadvantages, it has been recognised [Wass
et al (2001)] that Criterion referenced tests are the only acceptable means
of assessing that a pre-defined clinical competency has been reached
Solution?

It has been identified [Muijtjens (1998)] that a criterion referenced test
could be utilised if a test was constructed by selecting items from a bank
of items of known difficulty, which would then enable measurement and
control of the test difficulty.

Difficulty estimates, as defined by classical test theory, are sample
dependent!

Item Banking
Item banking

Item Banking is a process whereby all EMQ Items that have been
used over the past few years are ‘banked’ and calibrated onto the
same metric scale

Previously used EMQ Items

Psychometrically calibrated using Rasch Analysis

Data is linked by common items that have been used between the
exams
• “Common Item Equating”
Rasch Analysis



When data fit the model, generalisability of EMQ difficulties
beyond the specific conditions under which they were observed
occurs (specific objectivity )
In other words…
Item Difficulties are not sample dependent as they are in
Classical Test Theory
Item banking
Term 1
Term 2
Term 3
Term 4
ITEM 1
ITEM 2
ITEM 3
ITEM 4

These Items cannot be directly compared as there are no
common links
Item banking
Term 1
Term 2
Term 3
Term 4
ITEM 1
ITEM 2
ITEM 3
ITEM 4
ITEM 5

These Items can be directly compared via the common link item
across all terms
Item banking

Following calibration, the set of items within the bank will define
the common domain that that they are measuring (in this case,
medical knowledge).

These items, therefore, will provide an operational definition of
this unobservable (latent) trait.

When all EMQ Items are calibrated onto the same scale, then it
will be possible to compare the performance of students across
terms, despite the fact that the EMQ exam content was made up
of different items across each term.
Sufficient Linkage?



What is classed as sufficient linkage between item sets?
There has been some variation in the literature regarding this.
Three differing viewpoints suggest that:
• linking items should be the larger of 20 items or 20% of the total number of
items [Angoff (1971)]
• 5 to 10 items are sufficient to form the common link [Wright & Bell (1984)]
• one single common item could provide a valid link in co-calibrating datasets
[Smith (1992)].

However, it has also been suggested [Smith (1992)] that the larger the
number of common items across datasets, the greater degree of precision
and stability for the item bank.
Potential Problems?

Limited Linkage
• Data overlap reduced

Potential Misfit or DIF on Link
Items
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Q23
Q24
Q25
Term8
Term7
Term6
Term5
Term4
Term3
Term2
Sparse Data Matrix
Term1

Sample

Data was collected from 550 4th year medical students over 8
terms

All EMQ data was read in to a single analysis to establish the
Item Bank

RUMM2020 Rasch Analysis software was used

Over the 8 terms, 6 different test forms were used (the test
remained the same over the first 3 terms)
1 1
1 2
1 3
1 4
1 5
1 6
2 11
2 12
2 13
2 14
2 15
2 16
3 21
3 22
3 23
3 24
3 25
3 26
3 27
3 28
3 29
4 31
4 32
4 33
4 34
4 35
4 36
4 37
4 38
4 39
4 40
5 41
5 42
5 43
5 44
5 45
5 46
5 47
5 48
6 51
6 52
6 53
6 54
6 55
6 56
6 57
6 58
6 59
6 60
6 1051
7 61
7 62
7 63
8 71
8 72
8 73
8 74
8 75
9 81
9 82
9 83
9 84
9 85
9 86
10 91
10 92
10 93
10 94
10 95
10 96
2
3
4
5
6
10
10
10
11
11
11
11
12
12
12
12
13
13
13
13
14
14
14
14
15
15
15
15
15
15
16
16
16
16
17
17
17
17
17
18
18
18
18
18
18
18
19
19
19
19
19
19
20
20
20
20
20
20
21
21
21
21
21
22
22
22
22
22
23
23
23
23
23
23
23
97
98
99
101
102
103
104
111
112
113
114
121
122
123
124
131
132
133
134
141
142
143
144
145
146
151
152
153
154
161
162
163
164
165
171
172
173
174
175
176
177
181
182
183
184
185
186
191
192
193
194
195
196
201
202
203
204
205
211
212
213
214
215
221
222
223
224
225
226
227
1
2
3
4
5
6
Theme
Code
Question
Code
1
Theme
Code
Question
Code
Theme
Code
Question
Code
EMQ Item Bank
23
24
24
24
24
24
24
24
24
24
24
25
25
25
25
25
25
25
25
25
25
26
26
26
26
26
26
27
27
27
27
27
27
27
28
28
28
28
29
29
29
29
30
30
30
30
31
31
31
31
31
31
31
31
32
32
33
33
33
33
34
34
34
34
35
35
35
228
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
261
262
263
264
265
266
267
271
272
273
274
281
282
283
284
291
292
293
294
301
302
303
304
305
306
307
308
311
312
321
322
323
324
331
332
333
334
341
342
343
1
2
3
4
5
6
• Approximately
25% of the
items were changed from
exam form-to-exam form
• This provides good linkage
EMQ Item Bank
Pretty Good!!
To be expected
Low. But…
Low Person Separation Index


Person Separation Index is fairly low
We would expect this to a certain degree due to the highly focussed sample
Problematic Items

Approximately 12.5% (26/205) of Individual
Items were found to display some sort of misfit or
DIF
Misfit

What does misfit tell us?

Misfitting items are flagged up for non-inclusion and will either
be amended or removed
Differential Item Functioning
Misfit & DIF

What does misfit tell us?
Misfitting items are flagged up for non-inclusion and will either be
amended or removed

DIF could be due to:

• Curriculum changes
• Teaching Rotations
• Testwiseness

Does Exam Difficulty remain equal over
different exam forms?

Example across item set using Exam Form 1 as
the Baseline
Measurement and the pass mark
The latent trait
Increasing student ability
Minimum standard
0
The “real” assessment
100
EMQ Exam
60%
Pass mark
Equivalent standard?



Exam Form 1 was based on 98 Items
Pass Mark was set at 60%
60% of 98 = Raw Score pass mark of 58.8
Equivalent pass mark?

Raw score of 58.8 on Exam Form 1 = 0.561 logits
Equating Problems

We had to remove 2 extreme items

Exam form 1 was out of 98 anyway
Exam Form
Maximum
obtainable
score
1
98
2
99
3
100
4
100
5
100
6
99
Equivalent standard?
Exam form 1 had 98 items
60% of 98 = 58.8
58.8 score on Exam form 1 = 0.561 logits



Exam Form
0.561 logit
equated score
out of
1
2
3
4
5
6
58.7
57.7
58.2
59.6
62.4
59.9
98
99
100
100
100
99
59.90% 58.30% 58.20%

62.4% - 58.2% = 4.2%
59.60% 62.40% 60.50%
Conclusion

Item Banking is a way of assessing the psychometric properties of EMQs
that have been administered over different test forms

Can identify and adapt poor questions

Can perform a comparative analysis of relative test form difficulty

Should the pass mark be amended every term?
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
Miller GE. The assessment of clinical skills/competence/performance. Academic Medicine
1990; 65: S63-7
McHarg J, Bradley P, Chamberlain S, Ricketts C, Searle J, McLachlan JC. Assessment of
progress tests. Medical Education 2005; 39: 221-227.
Muijtjens AMM, Hoogenboom RJI, Verwijnen GM, Van der Vleuten CPM. Relative or absolute
standards in assessing medical knowledge using progress tests. Advances in Health Sciences
Education 1998; 3: 81-87.
McManus IC, Mollon J, Duke OL, Vale JA. Changes in standard of candidates taking the
MRCP(UK) Part 1 examination, 1985 to 2002: Analysis of marker questions. BMC Medicine
2005; 3 (13).
Downing S M. Item response theory: applications of modern test theory in medical education.
Medical Education 2003; 37: 739-745
Wass V, Van der Vleuten C, Shatzer J, Jones R: Assessment of clinical competence. Lancet
2001, 357(9620): 945-94
Angoff W H. Scales, norming, and equivalent scores. In: Thorndike R L, editor. Educational
Measurement. 2nd ed. Washington (DC): American Council on Education; 1971. p508-600
Wright B D, Bell S R. Item Banks: what, why, how. Journal of Educational Measurement 1984;
21(4): 331-345
Smith R M. Applications of Rasch Measurement. Chicago: Mesa Press; 1992
New Book
Smith EV Jr. & Stone GE (Eds.). Criterion Referenced Testing:
Practice Analysis to Score Reporting Using Rasch Measurement
Models. Maple Grove, Minnesota. JAM Press; 2009
Contact Details
Mike Horton:
[email protected]
Matt Homer:
[email protected]
Alan Tennant: [email protected]
Website:
http://www.leeds.ac.uk/medicine/rehabmed/psychometric/
Course
Introductory
Intermediate
Advanced
Date
March 10-12
May 12-14
Sept 15-17
Dec 1-3
March 23-25
May 18-20
Sept 14-16
Nov 30-Dec 2
May 17-19
Sept 20-22
Dec 6-8
May 23-25
Sept 19-21
Dec 5-7
Sept 23-24
Sept 22-23
2010
2010
2010
2010
2011
2011
2011
2011
2010
2010
2010
2011
2011
2011
2010
2011