Transcript Slide 1

Using Rasch modeling to investigate the psychometric properties of the OSCE
Carole Steketee, Michele Gawlinski & Elina Tor
Medical Education Support Unit, School of Medicine Fremantle, Notre Dame University Australia
= 51.86*0.4 + 43.79*0.4 + 70.17*0.2
Aim
•
To present a prototype of a validated psychometric evaluation of an OSCE that is grounded in the Rasch
unidimensional measurement , a unified theoretical conception of measurement.
Introduction
Objective Structured Clinical Examination (OSCE) is an assessment tool for clinical skills and competence
that has been widely used in the health sciences, particularly in medical education.
The goal for OSCE examination is to make reproducible pass/fail decisions and to position candidates
according to their demonstrated abilities. The instruments used, namely the OSCE stations must measure
candidate ability consistently. In other words, there should be invariant comparison and measurement.
Background
OSCE Design
• 4th year Postgraduate MBBS Students (N=80)
• 11 stations (@ 20 minutes) - 10 clinical stations and 1 station on Personal & Professional
Development - conducted in 2 sessions
• Clinical stations based on medical disciplines/surgical specialties
• Real and simulated patients
• 50 examiners (one examiner for each station)
• Summed scores across station as overall OSCE score for students
Examiner severity examined using raw scores by comparing mean rating by one examiner and the
mean average ratings by other examiners to the same group of students in all other stations. If the
difference in mean ratings is > + 2 SEM (significant leniency) or < -2SEM (significant stringency), ratings
by the examiner will be adjusted by linear equating taking the average ratings of other examiners in all
other stations as reference.
Graphical Evidence of Item Fit
Local Independence between Stations – evidence of construct validity
Item
I0001
I0002
I0003
I0004
I0005
I0006
I0007
I0008
I0009
I0010
I0011
Uniqueness of the Rasch Model
• The ideal standards of construct validity (invariance comparison, unidimensionality, sufficiency)
embedded in a mathematical formula;
• Separation of parameter estimates through conditional probability;
• Examinee-free item difficulty estimation & item-free examinee ability estimation;
Data Analysis
• Each station (max mark 20) analysed as one item
• Raw scores for station collapsed into 10 categories (0 to 9) to be fitted to the PRM (using RUMM2020)
• Compare observed rating pattern to the expected pattern of ratings predicted by the PRM
• Concurrent examination of data fit to the model at both individual items and overall test level
• Misfit (test, item or examinee) indicates anomaly in rating pattern, warrant further qualitative
investigation on conceptualization of construct, item quality, physical conditions of examination,
examinee’s unique circumstance that might impact on the rating
•
When the data fits the Rasch
model, the program then
transforms the ordinal raw
scores into a metric linear
interval scale using the unit of
logits
Estimate of modeled error
variance for each estimate of
examinee ability - a
quantification of the precision
and to describe the range
(confidence interval) within
which each person’s true ability
falls.
Overall Test of Unidimensionality
The Polytomous Rasch Model (PRM)
In its simplest form, when a candidate is rated for performance in a task, the log odds of a candidate being
rated in category x is modeled in the PRM as a function of the candidate’s ability and the task difficulty:
Targeting & Item Map
Log[P( Xvm  X ) / P( Xvm  X 1)]  v  m
P( Xvm  X )
P( Xvm  X 1)
v
m
is the probability of examinee v being rated in category X
is the probability of examinee v being rated in category X – 1
is the ability of examinee v
is the difficulty of task m
The Multi-Facets Rasch Model (MFRM)
The MFRM is an extension of the PRM which can be applied to partition the variance in ratings of
examinee’s performance into factors such as examinee ability, task difficulty and examiner severity (Sj)
Log[P( Xvm  X ) / P( Xvm  X 1)]  v  m  Sj
•
•
The Item-Trait Interaction fit statistics evaluate the suitability of the data for the construction of a
variable (clinical competence) and its measures (Wright & Masters, 1982; Wright & Stone, 1979).
It is a formal test of unidimensionality for the clinical tasks in all the 11 stations and the validity
of the summed scores as the measure of the examinee’s clinical competence.
A non-significant χ2 probability for the test of fit of data to the model (χ2 = 17.06, df=22, p=0.76) indicates
that the 11 stations in the OSCE exam map on to a common underlying latent construct, which is
clinical competence. Therefore, it is justified to take the summed score across stations as indicator of
examinee’s level of clinical competence.
Individual Item Fit & Item Difficulty Estimates
Station 9 – least
challenging tasks
Task difficulty estimates for each station
Limitation
The design of OSCE where each student was rated by one examiner in each station, and different set of
examiners for each station did not meet the data collection design required for Rasch analysis based on
the MFRM to account for examiner severity.
Delimitation
• Data was fitted to the PRM to investigate the validity of the summed score and to make sure
the individual station data fits the PRM before taking the summed station scores as overall
OSCE score ( Objective 1).
• When the data fits the model, the total raw score across stations are the sufficient statistics
that contain all information about the examinee’s ability and station difficulty.
• Investigation of examiner severity and adjustment of raw scores at this stage.
Suggestion for Assessment Practice in SoM
•
Rasch modeling is a practical tool for quality control and quality assurance measure for OSCE
examinations as described above – complementary to CCT
•
Establishing an item bank for OSCE which includes the psychometric properties of individual
station/items based on the linear measures of item/task difficulty, to enable linking of OSCE stations
across medical/surgical disciplines and across level of trainings through the co-calibration of test items
or test linking/equating.
•
Standard setting of OSCE based on Rasch measurement.
• The integration of Rasch modeling in scaling and item analysis for all assessment components (such as
written exams (MCQ, EMQ, SAQ) performance assessment such as OSCE, Mini-CEX, Professional
Portfolio, Clinical Audit etc).
Figure 1 Summary Test of Fit Statistics -for the Overall OSCE Exam
•
•
The difficulty of items and the
distribution of examinee ability
represented visually on a item-person
map or targeting map
Examinees and test items on the
same scale- testing examinee abiltiy
in relation to the task not to other
examinees
Figure 5
References
Andrich D, Lyne A, Sheridan B, Luo G. (2006). RUMM 2020. Perth: RUMM Laboratory
Fisher, W.P. Jr. (2001) Invariant thinking vs. invariant measurement. Rasch Measurement Transactions, 14:4, 778-81.
Linacre, J. M. (2009). A user’s guide to Facets Rasch measurement computer program, version 3.66.0. Chicago:
Winsteps.com.
Schumacker, R.E., Smith, E.V. (2007). A Rasch perspective. Educational and Psychological Measurement, 67(3), 394-409
Wright, B.D., & Masters, G.N. Rating Scale Analysis. Chicago MESA Press, 1982.
Wright, B.D., & Stone, M.H. Best Test Design. Chicago: MESA Press, 1979.
Figure 2 Individual items fit to the PRM (by item difficulty order)
χ2 fit statistics in the last column of Figure 2 shows the statistical evidence of data fit to
the Rasch model at each individual station level. Each station consistently/invariantly
separating the examinees in terms of their clinical competence.
The establishing one common scale (one ruler), to link all different test forms/formats (horizontal
tests linking). The same scale also to be used to link assessment data across different stages/year of
training (vertical test linking).
A path towards the realization of the vision for competence-based medical education – An
arduous but not insurmountable task!
Distribution of examinees on the
continuum of clinical competence is
skewed to the left as compared to the
distribution of the categories of
task/item difficulty - as expected for
OSCE
Station 10 –most
challenging tasks
Another evidence of unidimensionality
Rasch modeling is a formal test of invariant comparisons across items in a test and therefore the
unidimensionality of the latent construct across multiple stations in OSCE examination. It provides the
evidence for the validity of the sum-scores for the overall OSCE examination.
Figure 4 Individual Person Fit and Location Estimates (Excerpt) – by Location Order
Methods
Correlations
based on factors
other than clinical
competence
Conclusion
Individual Examinee Fit & Clinical Competence Estimations
Results (Rasch analysis)
to justify the validity of taking the summed score across stations as the overall OSCE
score;
to investigate and account for examiner leniency/stringency in the OSCE scores
Low residual correlations between items
Figure 7 Residual Correlation Matrix
Objective
•
I0001 I0002 I0003 I0004 I0005 I0006 I0007 I0008 I0009 I0010 I0011
1.000
0.024 1.000
-0.115 -0.068 1.000
0.044 -0.016 -0.228 1.000
0.013 0.040 -0.267 -0.069 1.000
-0.124 -0.048 -0.089 -0.062 -0.052 1.000
-0.055 -0.017 -0.016 -0.179 0.123 -0.198 1.000
0.008 -0.176 0.064 -0.183 -0.249 -0.061 -0.432 1.000
-0.002 -0.186 -0.250 0.031 0.186 0.150 -0.189 -0.110 1.000
-0.361 -0.141 -0.102 -0.078 -0.197 -0.142 -0.160 0.054 -0.234 1.000
-0.281 -0.303 0.010 -0.076 -0.283 -0.182 -0.043 0.023 -0.043 0.011 1.000
Figure 6 Item Map