Test Validation 101_7-6-2012

Transcript Test Validation 101_7-6-2012

Test Validation 101
2012 NILG Conference
August 29: 2:45 p.m. - 3:45 p.m.
Presenters: Dan Biddle, Ph.D., and Heather Patchell, M.A.
Overview of Biddle Consulting Group, Inc.
Affirmative Action Plan
(AAP) Consulting and
Fulfillment
• Thousands of AAPs developed each year
• Audit and compliance assistance
• AutoAAP™ Enterprise software
HR Assessments
• AutoGOJA™ online job analysis system
• TVAP™ test validation & analysis program
• CritiCall™ pre-employment testing for 911 operators
• OPAC™ pre-employment testing for admin professionals
• Video Situational Assessments (General and Nursing)
EEO Litigation Consulting
/Expert Witness Services
• Over 200+ cases in EEO/AA (both plaintiff and defense)
• Focus on disparate impact/validation cases
Compensation Analysis
• Proactive and litigation/enforcement pay equity studies
• COMPare™ compensation analysis software
Publications/Books
BCG Institute for Workforce
Development
Nation-Wide Speaking and
Training
• EEO Insight™: Leading EEO Compliance Journal
• Adverse Impact (3rd ed.) / Compensation (1st ed.)
• 4,000+ members
• Free webinars, EEO resources/tools
• Regular speakers on the national speaking circuit
Biddle Consulting Group Institute for
Workforce Development (BCGi)
• BCGi Standard Membership (free)
– Online community
– Monthly webinars on EEO compliance topics
– EEO Insight Journal (e-copy)
• BCGi Platinum Membership
– Fully interactive online community
– Includes validation/compensation analysis books
– EEO Tools including validation surveys and AI calculator
– EEO Insight Journal (e-copy and hardcopy)
– Members only webinars, training and much more…
www.BCGinstitute.org
Your Presenters Today…
• Dan Biddle, Ph.D., CEO
–
–
–
–
Over 20 years experience in EEO/AA & Testing
Experience in over 100 cases
Author of Test Validation & Adverse Impact (3rd ed.)
[email protected]
• Heather Patchell, M.A.
–
–
–
–
EEO/AA Consultant
Executive Director of BCGi
Masters I/O Psychology
[email protected]
Presentation Overview
• Our goal:
– Review “high level” validation criteria for four common assessment
devices
– Provide basic and practical steps for validating each
– Equip you with take-home tools for validation
– Provide convincing evidence that validation produces both
qualified applicants and defensible PPTs
• The assessment devices we’ll be covering include:
–
–
–
–
Basic Qualification (BQ) screens
Physical Ability Tests
Interviews
Written Tests
• Review the “Test Validation Checklist” for validating each
type of device
Adverse Impact: The Trigger for the
Validation Requirement
A Brief Review
Before looking at validation…when is
validation required?
•
•
•
•
•
Whenever your “PPT” exhibits adverse impact
Single Event: Adj-FET / Chi-Square p < .05
Multiple Event: Mantel-Haenszel / MEEP p < .05
Particular PPT
Overall Selection Process
Pass
Fail Totals
40
60
100
Women
60
40
100
Men
Passing Odds of Women:
67%
Passing Odds of Men
150%
Odds Ratio
2.25
P = .006
SD = 2.747
Adverse Impact in Context
How selection processes
are challenged . . .
Practice,
Procedure,
Test (PPT)
“or”
Plaintiff
Burden
Diff. in Rates?
YES
Defense
Burden
NO
Is the PPT
Valid?
END
YES
Plaintiff
Burden
Alternative
Employment
Practice?
NO
Defendant Prevails
YES
Plaintiff Prevails
NO
Plaintiff
Prevails
**OFCCP Insight(s)**
1. The OFCCP (typically) uses overall adverse impact as a
“red-flag” to identify where/when to investigate further.
2. If there is overall adverse impact, the OFCCP will
investigate the PPTs in the selection process.
3. It is absolutely imperative that the employer have the data
and the ability to analyze the individual steps in the overall
process.
4. If the necessary data is not available to perform step
analyses, the OFCCP can make an “adverse inference” . . .
(i.e., they can infer impact because the employer did not
collect the data they are required to collect).
A Brief Overview of Validation
Before Discussing Particular Type of PPTs,
Let’s Review Validation in General
• What is validity?
– Legally… “job related for the position in question
and consistent with business necessity”
– Practically… in jury trials, the test must somehow
rationally connect with the job
– With the OFCCP and other FEAs, it must comply
with UGESP (see www.uniformguidelines.com)
– We’ll focus on just two validation methods:
o
o
Content validation
Criterion-related validation
Guides Related To Validation Techniques
Principles
(SIOP)
Uniform
Guidelines
Key!!
Joint
Standards
Validity
Court
Precedence
Content Validation Process
Other KSAOs
Job
Duties
Operationally
defined KSAOs
Selection
Devices
(e.g., application
form, tests,
interviews)
Content Valid!
Criterion-related Validity
Job
Requirements
Job
Performance
Test Score
Criterion-related Validity
Performance Measure
Criterion-Related Study
70
60
50
40
30
20
10
0
0
20
40
60
80
Test Score
Score on some “Criteria” (e.g.,
job performance, days missed
work, etc.)
Score on a
“Test”
100
Basic Qualification Screens &
Validation Requirements
Validating Basic Qualification Screens
• What are BQs? Some examples…
– “Must be able to lift and carry XX pounds for YY feet”
– College degree in XX field
– Certificate in YY field
• Basic qualifications can:
–
–
–
–
–
Save the employer’s money and personnel resources
Reduce the size of the applicant pool
Allow qualified applicants to rise to the top
Reduce the amount of time it takes to fill job openings
Show applicants that the employer is serious about job
standards
Questions to ask about your BQs…
• Is the BQ likely to:
– Save your employer’s money and personnel resources?
– Result in an actual benefit to the target positions?
– Have adverse impact?
– Be perceived as a form of intentional discrimination?
– Survive an OFCCP Review as:
o Noncomparative?
o Objective?
o “Job relevant” and/or “job related and consistent
with business necessity”?
Before Launching the BQ, Ask:
• Is the BQ likely to:
– Represent a true “minimum baseline” needed for the
first day on the job?
– Be clearly understood by applicants?
– Be uniformly applied to all applicants?
– Discriminate (distinguish between qualified and
unqualified applicants)?
– Allow an equal opportunity for all applicants to
demonstrate that they possess the required levels?
Two Really Important BQ Concepts!
Important Concept #1:
If BQs have Adverse Impact, they Need to be “Validated”
Important Concept #2:
“Validation” is a DIFFERENT STANDARD than the
“job relevant” BQ requirement in the IA Regulations
Validation sometimes requires a different development process than
what might be used to set up “job relevant” BQs under the IA Regulations
Basic
Qualification
Review Standard for BQs Depends on
Whether they have Adverse Impact!
Noncomp?
Objective?
“Job Relevant”?
STANDARD 1: Int. App. Regs
YES
STANDARD 2: Title VII (e.g.,
Guidelines, 14C6)
“Job Related
& Consistent with
Bus. Necessity”?
NO=Disp. Imp.
Discrimination
YES=
Defensible
AND
Adverse
Impact?
YES
NO
NO
Int. App.
Regulation
Violation
Clarification on the “Two Standards” Offered
in the IA Regulations
• “That standard [the Title VII standard] is applicable as a
defense where a disparate impact has already been
proven” (p. 58957).
• By including the “relevant to performance of the particular
position’’ standard in the final rule as a limitation on
qualifications that could qualify as 'basic qualifications,'
OFCCP intends to provide a reasonable limit on the nature
of the qualifications used only to define recordkeeping
obligations. OFCCP does not intend to define
recordkeeping obligations through a presumption that
every putative 'basic qualification' involves a disparate
impact.
• Of course, once it is established that a criterion caused a
disparate impact, the contractor has the burden of
justifying that the criterion is job related and consistent
with business necessity (p. 58957).
What Review Standards Apply to BQs?
OFCCP’s Definition of an Internet Applicant
There are no
record retention
obligations at this
stage
Records must be retained
for all job seekers during
the following steps in the
process.
Only job seekers who meet all 4 requirements will be analyzed in
your Personnel Transactions and Adverse Impact Analyses
24
BQ Development & Validation Survey
• Use this survey for validating BQs
• Each row should contain incrementally higher levels of the BQ
• See Biddle (2010) Test Validation & Adverse Impact book for details
Weight Handling BQs and Physical
Ability Tests
A Worked Example… Establishing Defensible
BQs for Weight Handling Requirements
• Common Weight Handling BQs:
– Must be able to lift up to 50 pounds daily.
– Must be able to lift/carry 20-30 pounds routinely for
a 8 hour shift.
– May be required to carry, push, pull, drag or hold up
to 50 pounds.
– Person must be in excellent physical condition; be
able to lift and carry 80 pounds; and be able to work
under adverse conditions.
• Best Example:
– Must be able to lift and carry 54 pound boxes 100-150
times/8-hour shift for 10-30 feet each carry.
When it Comes to Setting Weight Handling BQs
for Your Job Postings . . .
Honest and qualified applicants may self-select
out of your hiring process!
One Method for Developing Weight
Handling BQs
• Step 1: Meet with management staff and create a list of
the common items that are physically handled by
incumbents.
• Step 2: Obtain weights for each item.
• Step 3: Survey job experts regarding:
– the frequency with which they handle (i.e., push/pull,
lift/carry, etc.) the items, and
– how they handle the items (e.g., how far, how long,
etc.)
One Method for Developing Weight
Handling BQs (cont.)
• Step 4: Analyze the survey Data:
– Remove “outliers” (using 1.65 SD rule) and/or raters
with low inter-rater reliability
– Establish “frequent” and “occasional” requirements
for various physical activities (push/pull, lift/carry,
and other physical requirements)
– Establish weight handling BQs for each position at a
level where at least 70% of job experts agree (e.g.,
“70% of job experts surveyed agreed that they must be
able to lift and carry at least 50 pounds 10 times a day
or less”)
– Final BQ should include weight, how handled (lift,
carry, push, pull, drag, rolled), and duration
One Method for Developing Weight
Handling BQs (cont.)
• Questions:
– Why establish the BQ weight using “at least 70% of job experts
agreed on a weight of X”
– Doesn’t that set the weight cutoff too high?
– Why not just use the average of their responses?
• Answers:
– After removing outliers, the dataset should represent opinions
from the “normal range” of job experts
– Using the 70% rule will help insure that at least the majority of
job applicants should be able to handle that weight
– The 70% rule “trims” the highest 30% of the ratings, insuring
that the benchmark is set at a reasonable level
– Using the average could possibly set the weight requirement at a
level that 50% of the job experts thought was too low
What about Jobs that have Rigorous and/or
Regular Weight Handling Requirements?
• Use a physical ability test!
– Key Point: BQ screens are only self-reports!
• Rigorous physical ability tests will typically have
adverse impact on women . . . therefore:
– They must be validated!
– Don’t rely on “abstract strength tests” or “body
measurement methods” without statistical validity!
– Sometimes it’s better to measure physical abilities
using “work sample” tests
o
o
This helps insure that applicants can perform the actual
job, not just the “inferred” job requirements
Applicant perception of fairness is the first trigger for
lawsuits!
Validating Interviews
Interviews and the Courts
• The question is still sometimes asked…
– “Are Interviews really tests”?
– Yes, they are really tests
• Any Practice, Procedure, or Test (PPT) that
separates two groups (e.g., men/women) based
on two possible outcomes (e.g., pass/fail) is
classified as a “test” under the Uniform
Guidelines.
Interview Defensibility & Validity:
Some General Characteristics…
Least Defensible Most Defensible
Unstructured
Single Rater
Generic “one size fits all”
Open Scoring/No Scoring
Low Validity
Unstructured
r= .11 - .18
Structured
Multiple Raters
Job Specific
BARS
High Validity
Structured
.24 - .34
Litigation Involving Interviews
• Is there a connection between Interview type and success in court?
• Williamson et al. (1997). Employment interview on trial: Linking
interview structure with litigation outcomes. Journal of Applied Psychology,
82 (6), 900-912.
– Study involving 84 disparate treatment and 46 disparate impact
cases where interviews were litigated
– 17 interview characteristics were evaluated (e.g., objective,
subjective, standardized, etc.).
– Study resulted in clear findings that revealed the three primary
ingredients for successful interview validity defense
Key Interview Defensibility Characteristics
• The Three Primary Factors Are…
– Interview objectivity and job relatedness, such as:
o Objective and specified criteria
o Trained interviewers
o Validation evidence
– Standardized administration, including:
o Scoring guidelines
o Minimal rater discretion
o Common questions
o Consistency
– Multiple Interviewers
o Implies a shared decision making process
o Rater reliability
Interview Rating Systems

Rating scale: avoid 3-point; use 7- or 9-point

Benchmark answers

Compare responses to benchmarks

Consider more points for certain questions
38
39
Rating Errors
 Halo/Horn Effect
 Leniency/Severity/Central Tendency
 Contrast Effect
 Biases and Stereotyping
 Fatigue
40
Using a Panel of Assessors
 Essential investment
 Staff and stakeholder morale
 Diverse perspectives
 Increased defensibility
 Shared responsibility in decision-making
Validating Some Common Written
Tests
Types of Written Tests
• Skill / Ability Tests
– Can typically be content validated
– Examples include:
o Math
o Reading Comprehension
o Language Arts
• Job Knowledge Tests
– Almost always content validated
– Examples include:
o Promotional movements
o Licensure / Certification
• Cognitive Ability / Personality
– Typically Require Criterion-Related Validity
Some Factors to Consider for Any Type of
Written Test…
• Are we measuring KSAs that are needed on the first
day of the job?
• If the test is based on content validity, are the KSAs
operationally defined?
• Do we have a job analysis that can be linked to the test?
• What is the reliability of the test?
• How will the scores be used?
• Does our use of test scores exhibit adverse impact?
• If so, do we have a validation report that addresses 15B
(criterion) or 15C (content) of the Guidelines?
– For commercially available tests, have we conducted a local
validation study, or a 7B transportability study?
“Using” Test Scores in a
Valid/Defensible Manner
How You Plan To Use Your Test Is Critical!
• Pass/Fail Cutoffs:
– “Normal Expectations of Acceptable Proficiency in the Workplace”
(Guidelines, 5H)
– Modified Angoff (U.S. v. South Carolina, USSC)
• Banding:
– Substantially Equally Qualified Applicants
– Statistically Driven (use Std. Error of Difference)
• Ranking. For content validity:
– Is there adequate score dispersion?
– Does the test have high reliability?
– Is the KSA performance differentiating?
• Weighted/combined with other tests
– How are the weights related to the job
– Do they come from the job analysis or SME ratings?
How Tests Can Be Used
Applicant
Tom
Stacy
Bob
Frank
Julie
Rozanne
Mark
Luke
Henry
Paul
Peter
Rebecca
Alyssa
Matthew
John
Annette
Ray
Thomas
Julissa
Score
100
100
100
100
99
99
98
98
97
97
96
96
95
94
93
93
92
91
90

Ranking assumes one applicant is reliably more qualified
than the other

Banding considers the unreliability of the test battery and
“ties” applicants

Pass/fail cutoffs treat all applicants as either “qualified”
or “not qualified”

Weighting/combining test scores can be done using
“compensatory” or using cutoff on each test then
weighting results
Characteristics of Pass/Fail Cut Scores
– NOT TYPICALLY DEFENSIBLE WHEN:
o Using an arbitrary cutoff (e.g., 70%)
o Using applicant scores to benchmark (e.g., setting cutoff scores
at mean-SD of applicant scores)
– TYPICALLY DEFENSIBLE WHEN:
o Consider “Normal expectations of acceptable proficiency in the
workplace” (Guidelines, 5H)
o Usually requires SME-level data or ratings
o Tied to job performance
– FACTORS TO CONSIDER:
o Is the test supported by content validity information or
criterion-related information?
o How critical are the KSAs measured?
o Does the test measure “baseline” or “differentiating” KSAs?
o How would current incumbents perform on this test?
Comparison Score Uses
Factor
Ranking
Banding
Pass/Fail
Cutoffs
Validation Requirements
High
Moderate
Low
Adverse Impact
High
Moderate
Low
Defensibility
Low
High
High
Litigation "Red Flag"
High
Moderate
Low
Utility
High
Moderate
Low
Cost
Low
Moderate
High
Applicant Flow
Restrictive/
Controllable
Moderate/
Controllable
High
Development Time
Low
Moderate
High
Reliability Requirements
High
Moderate
Low
# Item Requirements
High
Moderate
Low
Setting Validated Cutoff Scores Using the
“Modified Angoff” Method
Rater ID
1
2
3
4
5
6
7
8
9
10
Mean
SD
Item Number
1
2
3
4
5
6
7
8
9
10
100
80
60
100
90
100
70
50
100
50
80
90
50
50
50
80
50
60
90
100
60
90
100
90
60
60
70
90
50
80
100
70
80
100
100
80
90
50
50
80
80
50
80
50
80
70
50
70
70
80
70
50
70
60
100
50
50
80
50
100
100
60
70
60
100
60
60
70
60
80
50
70
80
80
60
50
70
50
50
70
70
90
70
80
90
50
50
50
50
80
80
90
50
100
90
70
80
70
70
60
79
74
71
77
82
67
64
64
64
78
17.29 16.47 15.24 20.58 18.74 16.36 14.30 14.30 18.38 15.49
Mean
SD
80
70
75
80
68
68
72
63
68
76
72
16.71
21.08
20.00
17.16
18.86
13.17
19.89
16.19
12.52
16.87
15.06
17.08
Test Validation Checklists
• Use these review checklists to determine the
validity of your PPTs under the requirements of
the Guidelines
Questions
Answers
Copyright © 2012 BCG, Inc.
51