Agenda - Mountain Measurement

Download Report

Transcript Agenda - Mountain Measurement

Standard Setting for Professional Certification

Brian D. Bontempo

Mountain Measurement, Inc.

[email protected]

(503) 284-1288 ext 129

Overview

• Definition of Standard Setting • Management Issues relating to Standard Setting • Standard Setting Process • Methods of Standard Setting • Using multiple methods of Standard Setting

Definition of Standard Setting

• Standard setting is a process whereby decision makers render judgments about the performance level required of minimally competent examinees

Types of Standards

• Relative Standard (Normative Standards) – Top 70% of scores pass – 20 points above average • Criterion-Referenced Standard (Absolute Standards) – 70% of the items correct – 600 out of 800 scaled score – .05 logits – 20 items correct

Why do we conduct Standard Setting?

• To objectively involve stakeholders in the test decision making process • To connect the expectations of employers to the test decision making process • To connect the reality of training to the test decision making process • To ensure psychometric soundness & legal defensibility

When to (re)set a passing standard

• For a new exam, after Beta Test data have been analyzed, typically after “Live” Test Forms have been constructed • For exam revisions, when the expectations of a job role have changed – Practice has changed – Content domain has changed – It is not appropriate to change the passing standard whenever a test or training has been revised.

– It is not appropriate to change the passing standard because of supply and demand issues (too many/few certified professionals)

Who should lead a standard setting panel?

• An experienced Psychometrician – Insider perspective, familiar with your certification and exam development – Outsider perspective, not familiar with your certification and exam development

How rigid should you be in your direction to the Psychometrician?

• I recommend a conversation between the Psychometrician and the Test Sponsor to figure out what works best. Typically a test sponsor will specify a framework (e.g., Angoff) and let the Psychometrician dictate the specifics.

Outcomes of Standard Setting

• A conceptual (qualitative) definition of minimal competency • A proposed numeric (quantitative) passing standard • A set of alternate passing standards based on errors in the process • Expected passing rate(s) from each standard • A report documenting the process and the psychometric quality of the process

Standard Setting Process

Standard Setting Process

• Gather test data • Assemble a group of judges – Define minimal competency – Train judges on the method – Render judgments on the performance of borderline examinees • Calculate the passing standard by aggregating the judgments • Evaluate the outcome by calculating the expected passing rate

Selecting your judges

• Representative Sample – Hiring Managers – Trainers – Entry-Level Practitioners • How many judges is enough?

– For a low stakes exam • at least 8 judges – For a medium stakes exam • at least 12 judges – For a high stakes exam • at least 16 judges

Developing a Definition of Minimal Competency

• Identify 3 common tasks within each domain of the test blueprint (an easy, a hard, and a “Borderline” task) • Characterize the performance of minimally competent examinees on each of the major tasks • Write text that summarizes these discussions

Training Judges

• Instruct them on their task • Practice rating items – Two sets of practice items • Practice discussing items • Explain the stats that you will be providing them • Set the tone and boundaries for good ‘group psychology’

Standard Setting Methods

Types of Standard Setting Methods

• Examinee-Centered Methods – Judges use external criteria, such as on the job performance, to evaluate the competency of real examinees • Test-Centered Methods – Judges evaluate the performance of imaginary examinees on real test items • Adjustments – in order to account for inaccuracy in the standard setting process, Psychometricians use real test data to provide a range of probable values for the passing standard

Examinee-Centered Methods

• Borderline group – Using external criteria (such as performance on the job), judges identify a group of examinees that they think are borderline examinees. The average score of this group is the passing standard • Contrasting groups – Using external criteria, judges classify examinees as passers or failers. The passing standard is established by determining the point which discriminates the best between the scores of both groups

Test-Centered

• Modified-Angoff – Angoff, W.H. (1971) Scales, Norms, and equivalent scores. In R.L. Thorndike (Editor) Educational Measurement 2 nd edition: Washington, DC American Council on Education.

• Bookmark – Mitzel, H.C., Lewis, D.M., Patz, R.J., & Green, D.R. (2001). The Bookmark Procedure: Psychological perspectives. In G.J. Cizek (Editor), Setting Performance Standards: Mahwah, NJ Lawrence Erlbaum Associates.

Basic Angoff Process

• Judges evaluate each item – What percentage of MC examinees would get the item correct?

• Feedback/Discussion • Judges make adjustments to their ratings • Average of all items is the judges passing standard • Average of all judges’ standards is the passing standard

Common Angoff Issues

• What percentage of – MCs vs. all – MCs is correct • candidates – “would” vs. “should” – “would” is correct • get the item correct?

Common Angoff Issues

• What type of ratings should judges make?

– 1/0 (Yes/No) – Percentage of Borderline examinees • Round to 1 decimal (.9) • Round to 2 decimals (.92) – NEVER use percentage of all examinees

Common Angoff Issues

• Types of Feedback to provide – Group Discussion • Relate to conceptual definition of minimal competency – Typical or atypical content – Relevancy • Relate to item nuances – Item Stem – Item Distractors • “I expect a lot of the MC because this is core content and the item is straightforward.” • “I would like to cut the MC some slack because this is not covered well in training and the scenario is a little abstract.”

Common Angoff Issues

• Types of Feedback to provide – Empirical Data • Answer Key – Yes!

• Percentage of Borderline examinees answering the item correctly – If possible yes • P-Value (Percentage of examinees answering the item correctly) – Only if the percentage of Borderline examinees is not available

Common Angoff Issues

• When to provide feedback?

– Initial Rating – Discuss items – Secondary Rating – Provide Empirical Data – Tertiary Rating

Bookmark

• Test is divided up into sub tests – By domain OR – Equal variance of difficulty across sub tests • Items are sorted from easiest to hardest – By judges OR – By actual value • Judges bookmark the subtest at the point where the MC examinee would stop getting items correct and start getting them incorrect • The lowest possible standard • The expected standard • The high possible standard • Judges discuss ratings & make adjustments • Passing standard is average # of items answered correct

Common Bookmark Issues

• How many Ordered Item Booklets (OIB) – One for each content domain – An equivalent number that meet the test plan

Common Bookmark Issues

• How should I select Items for the OIB?

– Minimize the distance in difficulty between any two adjacent items.

• Ensure that there are enough items at all difficulty levels for each OIB • Ensure that the variance in item difficulty is the same for each OIB

Common Bookmark Issues

• How should I sort the item booklets?

– Easiest to Hardest – Hardest to Easiest

Common Bookmark Issues

• How do I know when the MC would stop getting items correct and start getting them incorrect? (What is the appropriate RP value?) – .5

– .67* Most Common – .75

Common Bookmark Issues

• How do I convert the bookmark to a passing standard?

– Previous Item (PI) – Take the difficulty of the easier of the two items on either side of the bookmark – Between Item (BI) – Take the average of difficulty of the two items

Compare Angoff and Bookmark

• Angoff requires less preparation – Select a real test form as opposed to building the OIBs • Judges understand Bookmark better – Rating the difficulty of an item is a difficult task • Bookmark requires more test items – I’d recommend an item pool of at least 40 solid test items per content domain

Other Test Centered Methods

• Ebel • Nedelsky • Jaeger • Rasch Item Mapping

Ebel

• Judges sort each item into piles – How difficult is this item for the MC examinee?

• Easy, moderate, or hard – How relevant is this content for practice?

• Critical, Moderately important, Not relevant • Judges then estimate the percentage of items in each that MC examinees would get correct • The passing standard is then determined by multiplying the number of items in each cell by the percentage and sum all values

Nedelsky

• Judges determine which response options are unrealistic for each item • The probability of a guessed correct response is calculated • The sum of the probabilities is the passing standard

Jaeger

• Judges evaluate each item – Yes/No - “Should every entry-level practitioner answer this item correctly?” • Judges discuss ratings & make adjustments • Judges are provided passing rate based on standard & make adjustments • Passing standard is calculated by summing the number of “Yes” responses

Test-Centered Options

• What the ratings are based on – Should or would MC get this right • How ratings are made – Yes/No, Percentage • Relevance adjustments • Guessing adjustments • What kind of feedback is provided – Passing rate – Other judges ratings – Actual item difficulty

Using Multiple Methods of Standard Setting

Why use Multiple Methods?

• There is error in every standard setting • Allows policymakers to “decide” on the standard rather than science simply documenting the outcomes of a panel • Allows for the recovery of standard setting sessions that go awry • Involves more stakeholders

Adjustments

• Simple Stats – Calculate the confidence interval around the estimate • Beuk – Judges provide an expected passing score and an expected passing rate. Calculations are made that are based on the variability in these two estimates • De Gruijter – Similar to Beuk, judges also provide an estimate of the uncertainty of their judgments.

• Hofstee – Judges indicate the highest and lowest passing score and passing rate. These values are plotted along with the cumulative frequency distribution and the point of intersection is the passing standard

Survey of Hiring Managers

• Ask hiring managers about the workforce – What percentage of certified persons do you believe to be minimally competent?

– Are your certified persons more competent that your uncertified persons?

• Expands the reach of your exam

Triangulating results

• Psychometrician should present the outcome of each method and the passing rate associated with each outcome – A range of possible values • Policymakers can use this information and “their professional experience” to set the actual passing standard

Wrap-Up

3 Vital Recommendations

• Have more judges at standard setting • Spend more time training your judges • With each standard setting ensure that you take the time to define minimal competency conceptually and don’t forget to document this definition.

Concluding Remarks

• Many people like to think of test makers as big bad people which is obviously not true. Standard setting is one example of how inclusive the scientific process of test development can be. I encourage folks to make this process light and fun.

Thank you for paying attention!

Questions & Comments: [email protected]