2013-04-16-Chuang
Download
Report
Transcript 2013-04-16-Chuang
Fundamentals of Assessment and
Grading
Alice CHUANG, MD
Department of Obstetrics and Gynecology
University of North Carolina-Chapel Hill
Chapel Hill, NC
AOE Basic Teaching Skills Curriculum
April 16, 12:00 PM, Bondurant G010
APGO Clerkship Directors’ School
Neither I nor my spouse has any financial interests to
disclose related to this talk.
Objectives
Understand reliability and validity
Contrast formative and summative evaluation
Compare and contrast norm-referenced and
criterion referenced assessments
Improve delivery of feedback
Understand the NBME exam
Be familiar with different testing formats, their
uses and their limitations
Terminology
Validity: Are we measuring what we think we’re
measuring
Content: Does the instrument measure the depth and
breadth of the content of the course? Does it
inadvertently measure something else?
Construct: Does the evaluation criteria or grading
construct allow for true measurement of the
knowledge, skills or attitudes taught in the course? Is
any part of the grading construct irrelevant?
Criterion: Does the outcome correlate with true
competencies? Relate to an important current or future
events? Is the assessment relevant to future
performance?
http://pareonline.net/getvn.asp?v=7&n=10
Examples
Validity
Content: A summative ob/gyn test which covered only
obstetrics
Construct: You allow students to use their textbook for
a knowledge-based multiple choice test of foundational
information on prenatal care.
Criterion: New Coke v. Old Coke
Terminology
Reliability: Are our measurements consistent?
The score should be the same no matter when it
was taken, who scored it, or when it was scored.
Interrater reliability: Is a student’s score consistent
between evaluators?
Intrarater reliability: Is a student’s score consistent with
the same rater even if rated under different
circumstances?
Scoring rubric: standardized method of grading to
increase interrater and intrarater reliability
http://pareonline.net/getvn.asp?v=7&n=10
Examples:
In general, if you repeat the same assessment, will
you get the same answer?
Interrater: 3 individuals are asked to go to the
beach and estimate how many seagulls they see
from 6-7AM and come up with 200, 800 and 1200.
Intrarater: A particular food critic always gives
low scores for food quality if the server is female.
Examples: Show Choir Audition Rubric
Poor Candidate
0 points
Fair Candidate
1 points
Good Candidate
2 points
Superior Candidate
3 points
Singing Skills
Sings with as much
expression as a wet
noodle, cannot identify
which tune candidate is
singing, also cannot
identify what the lyrics
of song are secondary
to poor pronunciation
Minimally expressive,
pitch off significantly on
occasion, diction unclear
at times
Very expressive, sings
on pitch most of the
time with minor errors,
diction clear most of the
time
Artistically expressive,
sings on pitch, diction
clear
Dancing Skills
Has 2 left feet, unable to
learn new steps and
continues to dance like
MC Hammer despite
different choreography
demonstrated
Missteps despite
multiple attempts, no
artistic expression in
dance moves, unable to
learn new choreography
after 3 demonstrations
Occasionally missteps,
but overall dance steps
are accurate, adapts
choreography fairly
rapidly,
Quick and nimble,
dances artistically, able
to learn new
choreography quickly.
Freely admits not
knowing what GLEE is
Endorses enjoyment of
GLEE, but unable to
identify favorite
character
Has watched 70% of
GLEE episodes
Has seen every episode
of GLEE, all GLEE albums
confirmed in iTUNES
library, has been to
GLEE LIVE each summer
Enthusiasm
for show
CHOIR
Formative v. summative assessments
Formative: on-going assessment, designed to help
improve educational program as well as learner
progress
Summative: designed to evaluate student overall
performance at end of educational phase and
evaluate effectiveness of teaching
http://fcit.usf.edu/assessment/basic/basica.html
Examples
Formative: short multiple choice exam written in
house that is pass/fail; answers are reviewed with
class at end of testing session
Summative: NBME exam
Formative v. summative assessments
ED30: The directors of all courses and clerkship
must design and implement a system of
formative and summative evaluation of student
achievement in each course and clerkship.
Those responsible for the evaluation of student performance
should understand the uses and limitation of various test
formats, the purposes and benefits of criterion-referenced vs.
norm-referenced grading, reliability and validity issues, formative
vs. summative assessment, etc….
Formative v. summative assessments
ED31: Each student should be evaluated early
enough during a unit of study to allow time
for remediation
ED32: Narrative descriptions of student
performance and of non-cognitive
achievement should be included as part of
evaluations in all required courses and
clerkships where teacher-student interaction
permits this form of assessment.
Formative v. summative assessments
Uses for
assessments
Formative
Summative
Purpose
Feedback for learning
Certification/Grading
Breadth of scope
Narrow focus on
specific objectives
Broad focus on general
goals
Scoring
Explicit feedback
Overall performance
Learner affective
response
Little anxiety
Moderate to high
anxiety
Target audience
Learner
Society
Characteristics of feedback
Effective Feedback:
• given with the goal of improvement
timely
honest
respectful
clear
issue-specific
objective
supportive
motivating
action-oriented
solution-oriented
Destructive Feedback:
• unhelpful
accusatory
personal
judgmental
subjective
It also
undermines the self-esteem of
the receiver
leaves the issue unresolved
the receiver is unsure how to
proceed.
http://www.expressyourselftosuccess.com/the-importance-of-providing-constructive-feedback/
Feedback…from APGO/CREOG 2011
When you…
You give the impression…
I would stop…
I would recommend…instead
Norm-referenced v. criterion- referenced
assessments
Norm-referenced
Purpose is to classify students in order of achievement
from low to high
Allow comparisons of students
May not give accurate information regarding student
abilities
Half of the students should score above midpoint score
and the other half should score below midpoint score
Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment.
Med Educ, Vol 43, Issue 12.
Norm-referenced v. criterion- referenced
assessments
Criterion-referenced
Purpose is to evaluate students knowledge and skills
compared to a pre-determined goal performance level
Gives information about a student’s achievement of
certain objectives
Should be possible for everyone to earn a passing score
Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment.
Med Educ, Vol 43, Issue 12.
Example
Norm-referenced: Soccer tryouts where 11 players are
chosen out of 40
Criterion-referenced: Test for driver’s license
Norm-referenced v. criterion- referenced
assessments
Be sure your assessment is appropriately normreferenced or criterion referenced.
Be sure that your assessment is designed with
this in mind.
Most assessments in medical education are
criterion-referenced.
Norm-referenced tests should emphasize
variability; criterion-referenced tests should
emphasize accuracy of tested material.
NBME
Exams
Developed by committees and content experts
Same protocol used to build Step 1 and Step 2
In general
Subject exams provided to all 130 LCME accredited
medical school is US
8 Canadian medical schools
8 osteopathic medical school
22 international medical schools
NBME
Scaled to have a mean of 70 and SD of 8
based on 9000 first-time test takers from 80+
schools who took exam as end-of-clerkship
exam in 1993-94
Scores do not reflect percentage of questions
answered correctly.
NBME: What do those scores mean?
Score
2011-2012
Total
year
Q1
Q2
Q3
Q4
93 or above
98
99
98
97
97
92
97
98
98
97
96
86
90
93
91
89
88
80
75
80
77
73
71
78
67
71
69
63
62
74
49
54
51
45
44
70
29
33
32
26
25
62
6
7
6
5
4
60
3
4
4
3
2
A score of 60 in the fourth quarter means that 2% of the examinees in the
fourth quarter scored 60 or below!
NBME: Academic purpose for exam
%
Advanced placement
5
Course/clerkship
95
Year-end
12
Make-up
21
Minimal competence
44
Identify at risk students
23
Practice for USMLE
47
Promotion requirement
37
Review course
1
Student self-assessment
26
Other
4
Total responses:
78
NBME: Weight given the subject exam
Weight given the subject exam
1-10%
11-20%
21-30%
31-40%
41-50%
>50%
Total number responding
%
4
16
33
39
13
0
70
NBME 2008 Clerkship Survey Results
Assessment/Evaluation Method
Ob/gyn (%)
Computer Case Simulations
0.5
Subject Exam
30
School’s MCQ Exam
9
Observation and evaluation by residents
28
Observation and evaluation by faculty
26
Oral exam
14
OSCE
12
Peer evaluation
1
Standardized patient exam
3
Other
18
Total number responding
81
NBME
2004 and 2009 survey of performance guidelines
across clerkship
Recommend setting an absolute versus a relative
standard for performance
Angoff Procedures: item-based, judges provide guess of
minimally proficient examinees that answer each
question correctly
Hofstee Method: judges determine minimum and
maximum scores for passing and percentage of
failures…then plotted against a graph made up of exam
score and failure rate
NBME
Testing Formats
Multiple choice exam (MCQ)
Objective structured clinical examination
(OSCE)
Oral examination
Direct observation
Simulation
Standardized patient
Patient/procedure log
Casey et al, To the point: reviews in medical
education – the Objective Structured Clinical
Medical record reviews Examination. AJOG, Jan 2009.
Written essay questions
Testing format: MCQ
Use distractors which could plausibly represent
correct answer
Use a question format, not complete-the-statement
format
Emphasize higher-level thinking, not strict
memorization
Keep option length consistent within a question
Balance the placement of the correct answer
Use correct grammar
Avoid clues to the correct answer
Highly reliable and valid for assessing knowledge
http://testing.byu.edu/info/handbooks/14%20Rules%20for%20Writing%20Multiple-Choice%20Questions.pdf
Testing format: OSCE
Examinees rotate through circuit of stations (5-10 minutes
each)
One-on-one examination (with examiner or trained or
simuated patient)
List of criteria for successful completion of each station
Each station test a specific skill or competency
Good for examining higher-order skills, clinical and
technical skills
Requires large amount of resources
Testing format: Oral Exam
Portfolio based: similar to case-based portion of Oral
Boards
Poor inter-rater and intra-rater reliability
Scores higher when scored live verses on video
Teaching students how to do better on oral exam does not
improve scores
Practicing oral exams does improve scores
Mock public oral exam improves performance
Limitations
Halo effect (grade reflects not only performance on exam but
also previous experience)
Subconscious consensus grading: examiners take subconscious
cues from each other.
Burch & Seggie, 2008; Kearney et al, 2001; Buchard et al, 2007; Jacobsohn et al, 2006
Testing format: Oral Exam
Is an oral exam justified? Is there an advantage?
Does the material lend itself to open questioning?
How will communication skills, delivery of information be
graded? Will only content be graded?
Is the examiner experienced? Will he/she skew grades in
any way?
How will you prepare students for the exam?
Is there enough time for every student to examine them
adequately?
How much prompting/assistance is allowed for oral
examination? How much time will you allow for “thinking?”
How will you ensure consistency in these areas for all
examinees?
http://testing.byu.edu/info/handbooks/14%20Rules%20for%20Writing%20Multiple-Choice%20Questions.pdf
Testing format: Direct observation
Formalized criteria
Various observers
True-to-life clinical setting (versus simulated)
Numerical scores
Comment anchored
Improve reliability with multiple perspectives
Consider 360 evaluation (including self, patient and
other staff members)
Testing format
MCQ
OSCE
Direct obs
Content
+++
++
+
+
Construct
+++
++
+
+
Criterion
+
++
+
+
Reliability
+++
++
+
+
Formative
Y
Y
Y
Y
Y
Y
Y
Y
Normreferenced
Y
N
N
N
Criterionreferenced
Y
Y
Y
Y
Summative
Oral exam
General rules of thumb
Be sure your assessment
Provides reliable data
Provides valid data
Provides valuable data
Is feasible
Can be incorporated into the systems in place
(hospital, clinic, curriculum, etc)
Is consistent with course objectives
Utilizes multiple instruments, multiple assessors and
multiple points of assessment
Aligns with pre-specified criteria
Is fair
Lynch and Swing. Key Considerations for Selecting Assessment Instruments and
Implementing Assessment Systems. ACGME.
References
Bond, Linda A. (1996). Norm- and criterion-referenced testing. Practical Assessment, Research &
Evaluation, 5(2). Accessed at http://pareonline.net/getvn.asp?v=5&n=2
Burch VC, Seggie JL. Use of a structured interview to assess portfolio-based learning. Med Ed 2008:
42: 894-900.
Burchard K et al. Is it live or is it Memorex? Student oral examinatinos and the use of video for
additional scoring. Am J Surg. 193 (2007), 233-236
Casey et al, To the point: reviews in medical education – the Objective Structured Clinical
Examination. AJOG, Jan 2009.
Jacobsohn E , Kock PA, Avidan M. Poor inter-rater reliability on mock anesthesia oral examinations.
Kearney RA et al. The inter-rater and intra-rater reliability of a new Canadian oral examinatino
format in anesthesia is fair to good. Can J Anesth 2002; 49:3, 232-236.
Lynch and Swing. Key Considerations for Selecting Assessment Instruments and Implementing
Assessment Systems. ACGME.
Metheny WP, Espey EL, Bienstock J, et al. To the point: Medical education reviews evaluation in
context: Assessing learners, teachers, and training programs. Am J Obstet Gynecol.
2005;192(1):34-37.
Moskal, Barbara M. & Jon A. Leydens (2000). Scoring rubric development: validity and reliability.
Practical Assessment, Research & Evaluation, 7(10). Retrieved December 29, 2009 from
http://PAREonline.net/getvn.asp?v=7&n=10
Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment. Med Educ,
Vol 43, Issue 12.
References
14 Rules for Writing Multiple Choice Questions. Brigham Young University 2001 Annual Conference.
Accessed at http://testing.byu.edu/info/handbooks/14%20Rules%20for%20Writing%20MultipleChoice%20Questions.pdf
Formative vs. Summative Assessments. Classroom Assessment. Accessed at:
http://fcit.usf.edu/assessment/basic/basica.html
NBME 2008 Clinical Clerkship Director Survey Results. Accessed at
https://portal.nbme.org/web/medschools/home?p_p_id=62_INSTANCE_dOGM&p_p_action=0&p
_p_state=maximized&p_p_mode=view&p_p_col_id=column1&p_p_col_count=1&_62_INSTANCE_dOGM_struts_action=%2Fjournal_articles%2Fview&_62_INS
TANCE_dOGM_keywords=&_62_INSTANCE_dOGM_advancedSearch=false&_62_INSTANCE_dO
GM_andOperator=true&_62_INSTANCE_dOGM_groupId=1172&_62_INSTANCE_dOGM_searchAr
ticleId=&_62_INSTANCE_dOGM_version=1.0&_62_INSTANCE_dOGM_name=&_62_INSTANCE_d
OGM_description=&_62_INSTANCE_dOGM_content=&_62_INSTANCE_dOGM_type=&_62_INST
ANCE_dOGM_structureId=&_62_INSTANCE_dOGM_templateId=&_62_INSTANCE_dOGM_status
=approved&_62_INSTANCE_dOGM_articleId=817480
Objective Structured Clinical Examination. Wikipedia. Accessed at
http://en.wikipedia.org/wiki/Objective_structured_clinical_examination
Reliability and Validity. Classroom Assessment. Accessed at:
http://fcit.usf.edu/assessment/basic/basicc.html
Talk about teaching: Significant issues in Oral Examinations. Contributed by Meryl Carlson,
Concordia College, Moorhead, MN. Accessed at
http://www.cord.edu/faculty/ulnessd/oral/MCarlson/questions.html